In [28]:
pip install PDFplumber

Collecting PDFplumber
  Downloading pdfplumber-0.6.2-py3-none-any.whl (36 kB)
Collecting pdfminer.six==20220319
  Downloading pdfminer.six-20220319-py3-none-any.whl (5.6 MB)
Collecting Wand>=0.6.7
  Using cached Wand-0.6.7-py2.py3-none-any.whl (139 kB)
Collecting cryptography
  Downloading cryptography-37.0.2-cp36-abi3-win_amd64.whl (2.4 MB)
Collecting chardet
  Downloading chardet-4.0.0-py2.py3-none-any.whl (178 kB)
Installing collected packages: cryptography, chardet, Wand, pdfminer.six, PDFplumber
Successfully installed PDFplumber-0.6.2 Wand-0.6.7 chardet-4.0.0 cryptography-37.0.2 pdfminer.six-20220319
Note: you may need to restart the kernel to use updated packages.




In [213]:
import pandas as pd
import pdfplumber
import re
import os, os.path
from dateutil import parser

## Extracting available data

### Prior data
When the school provided those PDF documents, which were extracted from a free online form builder for school applications, my first action was to take a look in the very first pages of the first document and understand it. I realized that, instead of organized tables with information about the students, all data was stored in form-based PDFs.  It was very look alike a printed document filled by each student separately. The main goal of keeping those documents were just records of students inscriptions, with no strategic business intent for that data.

Therefore, I've proposed a different pipeline for that data, that not only could escalate, enabling larger volumes of data, but also that allowed the school to catalogue, clean, filter, manipulate and analyze all that value informations to find the best business solutions.


In [210]:
pdf_test = pdfplumber.open(".\\data\\211596498612667-0.pdf")
print(len(pdf_test.pages))
print(pdf_test.pages[0].extract_text())
print(pdf_test.pages[1].extract_text())
print(pdf_test.pages[2].extract_text())
print(pdf_test.pages[3].extract_text())

30
Sunday, June 20, 2021
Fiche d'inscription 
Nom et prénom d'élève  Blatt Luce
Date de naissance 10 19 1960
Adresse Rue des vincennes, 9
Toulouse, 31500
E-mail marieluceblatt@gmail.com
Téléphone (0033) 607-103468
Cours:      
Horaire:   
Lundi      12h15 Barre à terre   
Lundi Heure
Mardi    9h Barre à terre   
Mardi Heure
          
Jeudi    9h30 pbt     
Jeudi Heure
Vendredi  10h classique moyen 
Vendredi Heure
      
            
Téléverser le Certi cat Médical
CamScanner 06-20-2021 20.39.pdf
pdf
1
Create your own automated PDFs with Jotform PDF Editor- It’s free
Téléverser le Certi cat d’assurance 
extra-scolaire ou assurance civil
66_CamScanner 06-20-2021 20.39_8995.pdf
pdf
Le paiement du cours sera effectué avec Par 4 cheques
(1-10)
 chèques deXxx  €, au total de     + 30€
(valeur de chaque chèque)
adhésion .
Moins 10€ pour le paiement comptant ou trimestre.
 Découvrez notre planning et nos tarifs a
attitudecorpsetdanses.com/tarifs-et-planning
Pour l’abonnement annuel à Attitude

### Proposing a new way to store data

So, for a better comprehension of the available data, i've tried to create an organized table with all subscriptions information. Regardless any particularity of each student, they all filled the same form for submission, so I could easily identify the fields despite there was no obvious separators between fields.

The inscription form provides lots of information about each student. I created lists to store these informations, according to the fields filled.

In [345]:
def read_pdfs (pdf):
    """open the pdf file and extract all the text information corresponding to each field
    return info"""
    info = []
    for page in range(len(pdf.pages)):
        info.append(pdf.pages[page].extract_text())
    info = "".join(info).replace("Create your own automated PDFs with Jotform PDF Editor- It’s free","").split("Fiche d'inscription")
    for student in range(1,len(info)):
        try:
            nom = info[student].split("Nom et prénom d'élève")[1].split("\n")[0].strip()
        except:
            nom = 0
        try:
            naissance = info[student].split("Date de naissance")[1].split("\n")[0].strip()
        except:
            naissance = 0
        try:
            adresse = info[student].split("Adresse")[1].split("\n")[0].replace("\n"," ").strip()
        except:
            adresse = 0
        try:
            cite = info[student].split("Adresse")[1].split("\n")[1].split(",")[0].replace("\n"," ").strip()
        except:
            cite = 0
        try:
            postal = info[student].split("Adresse")[1].split("\n")[1].split(",")[1].split("\n")[0].replace("\n"," ").strip()
        except:
            postal = 0
        try:
            email = info[student].split("E-mail")[1].split("\n")[0].strip()
        except:
            email=0
        try:
            representant_legal = info[student].split("Représentant légal de l’inscrit (pour")[1].split("les mineurs")[0].strip()
        except:
            representant_legal = 0
        try:
            tel = info[student].split("Téléphone")[1].split("\n")[0].strip()
        except:
            tel = info[student].split("Téléphone")[1].split("\n")[0].strip()
        try:
            cours = info[student].split("Cours:")[1].split("Horaire:")[0].replace("\nCours", "").strip()
        except:
            cours = 0
        try:
            horaire = info[student].split("Horaire:")[1].split("Cours 2")[0].replace("\xa0", "").replace("Heure", "").strip()
            cours2 = info[student].split("Cours 2:")[1].split("Horaire:")[0].replace("\nCours", "").strip()
            try:
                horaire2 = info[student].split("Horaire:")[2].split("Cours 3")[0].replace("\xa0", "").replace("Heure", "").strip()
                cours3 = info[student].split("Cours 3:")[1].split("Horaire:")[0].replace("\nCours", "").strip()
                horaire3 = info[student].split("Horaire:")[3].split("Téléverser")[0].replace("Heure", "").strip()
            except: 
                horaire2= info[student].split("Horaire:")[2].split("Téléverser")[0].replace("\xa0", "").replace("Heure", "").strip()
                cours3 =0
                horaire3=0
        except:
            try:
                horaire = info[student].split("Horaire:")[1].split("Téléverser")[0].replace("\xa0", "").replace("Heure", "").strip()
            except:
                horaire = 0
            cours2 =0
            horaire2=0
            cours3 =0
            horaire3=0
        try:
            adhesion = info[student].split("\xa0\xa0\xa0\xa0\xa0+")[1].split("\n(valeur de chaque chèque)")[0].strip()
        except: 
            adhesion = info[student].split("Please Select")[1].split("adhésion")[0].strip() 
        try:
            paiement_fractionne =info[student].split("Par")[1].split(" cheques")[0].strip()
        except:
            paiement_fractionne =info[student].split("avec")[1].split("chèques")[0].strip()
        paiement_total = info[student].split("au total de")[1].split("€")[0].strip()
        name.append(nom)
        birthday.append(naissance)
        address.append(adresse)
        city.append(cite)
        pcode.append(postal)
        mail.append(email)
        telephone.append(tel)
        legal_representative.append(representant_legal)
        course.append(cours)
        schedule.append(horaire)
        course2.append(cours2)
        schedule2.append(horaire2)
        course3.append(cours3)
        schedule3.append(horaire3)
        registration.append(adhesion)
        installments.append(paiement_fractionne)
        total.append(paiement_total)

All available data was stored in the same folder, and the files' name differ only by the number at the end of them, from 0 to the last. 

In [346]:
name=[]
birthday=[]
address = []
city = []
pcode =[]
mail =[]
telephone =[]
legal_representative =[]
course=[]
schedule =[]
course2 =[]
schedule2 =[]
course3=[]
schedule3 =[]
registration =[]
installments =[]
total =[]
files = os.listdir('C:\\Users\\Tete\\Curso - DA\\Projeto Final\\data') 
for file in range(len(files)-1):
    pdf = pdfplumber.open(f".\\data\\211596498612667-{file}.pdf")
    read_pdfs (pdf)

### Creating the dataframe
After extracting all the information avaialable on those forms, I've gathered them in a dataframe which columns are the fields from the submission form.

In [348]:
attitude = pd.DataFrame(zip(name,birthday, address, city, pcode, mail, telephone, legal_representative, course, schedule, course2, schedule2, course3, schedule3, registration, installments, total))
attitude.columns = ['name','birthday', 'address', 'city', 'pcode','mail', 'telephone', 'legal_representative', 'course', 'schedule', 'course2', 'schedule2', 'course3', 'schedule3', 'registration', 'installments', 'total']

In [369]:
attitude

Unnamed: 0,name,birthday,address,city,pcode,mail,telephone,legal_representative,course,schedule,course2,schedule2,course3,schedule3,registration,installments,total
0,Blatt Luce,19-10-1960,"Rue Des Vincennes, 9",Toulouse,31500,marieluceblatt@gmail.com,(0033) 607-103468,0,,Lundi 12h15 Barre à terre\nLundi \nMardi 9h Ba...,0,0,0,0,30€,4,+ 30
1,Paulon Lily,24-07-2014,"Renée Aspe, 2",Toulouse,31000,do_julia@hotmail.com,(+33) 648-949998,Paulon Julia,Classique,Lundi 17h\nLundi \n\n\n\n\n\n\n1,0,0,0,0,30€,4,500
2,Mezard Emmanuelle,03-02-1970,"Rue Sarah Bernhardt, 44",Toulouse,31200,emmanuelle.mezard@free.fr,(6) 034-64351,0,Barre Moyen,Mardi 19H\nMardi \n\n\n\n\n\n1,Pbt + Ballet Fitness,Jeudi 9H30\nJeudi,Pbt + Ballet Fitness,Vendredi 19h -20h30 \nVendredi,30€,3,880
3,Corbiere Beatrix,20-09-1960,"Jean Gayral, 83",Toulouse,31200,lacabiche@free.fr,(33) 066-3253652,0,,Lundi 20h30\nLundi \n\n\n\n\n\n\n1,,19h,,0,30€,1,720
4,François Eve,11-08-1948,"Avenue Winston Churchill, 10",Toulouse,31100,francoiseve9@gmail.com,(+33) 761-113634,0,Barre Au Sol,Lundi 12h15\nLundi \n\n\n\n\n\n\n1,Barre,Vendredi 19h\nVendredi,Barre,0,30€,3,720
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,Peccia-Galletto Sasha,17-06-2016,"Monplaisir, 40",Toulouse,31400,ln.seguela@gmail.com,(06) 249-71201,Séguéla Hélène,Débutant,Mardi 17H\nMardi,0,0,0,0,30€,3,500
96,Galvani Francoise,04-09-1959,6 Rue Pierre De Fermat,Toulouse,31000,francoisegalvani@icloud.com,(0033) 609-380633,0,0,0,0,0,0,0,30€ adhésion .,1,+ 30
97,Lemozit Sasha,05-10-2007,"Alleee Des Demoiselles, 6",Toulouse,31400,christophelemozit@yahoo.fr,(33) 682-463223,Lemozit Christophe,Moderne,Vendredi 18h\nVendredi,0,0,0,0,30€,3,500
98,Blanc Bouny Celeste,13-07-2013,"Rue Galilee, 9",Toulouse,31500,clairebouny0412@yahoo.fr,(06) 811-70151,Bouny Claire,Danse Classique Debutant,Lundi 17H\nLundi,0,0,0,0,30€,3,500


## Transforming data   

### Checking columns:

In [350]:

#Dropping duplicates
attitude = attitude.drop_duplicates()

In [358]:
#Let's standardize it!
# Strings
attitude.name = [nom.title() for nom in attitude.name]
attitude.address = [adresse.title() for adresse in attitude.address]
attitude.city = [cite.title() for cite in attitude.city]
attitude.mail = [email.lower() for email in attitude.mail]
attitude.legal_representative = [representant.title() if representant != 0 else 0 for representant in attitude.legal_representative]
attitude.course = [cours.title() if cours != 0 else 0 for cours in attitude.course]
attitude.course2= [cours2.title() if cours2 != 0 else 0 for cours2 in attitude.course2]
attitude.course3= [cours3.title() if cours3 != 0 else 0 for cours3 in attitude.course2]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy [1493743463.py:3]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy [1493743463.py:4]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy [1493743463.py:5]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentat

In [352]:
#Birthday column
issues = []
for naissance in attitude.birthday:
    try:
        date = parser.parse(naissance)
    except:
        issues.append(naissance)
print(issues)
naissances=[]
for naissance in attitude.birthday:
    if naissance != 0:
        date = parser.parse(naissance.replace('25 nivelbre 1947', '25-11-1947').replace('18 décembre 2012', '18-12-2012').replace('17 août 2008', '17-08-2008').replace('4 juin 1975', '04-06-1975').replace('1er mai 2014', '01-05-2014').replace('18 AOUT 2010', '18-08-2010').replace('23 JANVIER 2017', '23-01-2017').replace('30061986', '30-06-1986').replace('17 juin 2016', '17-06-2016'))
        naissances.append(date.strftime('%d-%m-%Y'))
    else:
        naissances.append(0)
attitude.birthday = naissances

['25 nivelbre 1947', '18 décembre 2012', '17 août 2008', '4 juin 1975', '1er mai 2014', 0, '18 AOUT 2010', '23 JANVIER 2017', 0, '30061986', '17 juin 2016']


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy [1156981790.py:16]


In [356]:
#Address column:
attitude[attitude['address'] == '44, Rue Sarah Bernhardt, 44, Rue Sarah Bernhardt'] = attitude[attitude['address'] == '44, Rue Sarah Bernhardt, 44, Rue Sarah Bernhardt'].replace('44, Rue Sarah Bernhardt, 44, Rue Sarah Bernhardt', 'Rue Sarah Bernhardt, 44')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy [3625858055.py:2]
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy [3625858055.py:2]


In [378]:
#Column Telephone:
for row in range(len(attitude.telephone)):
    attitude['telephone'].iloc[row] = attitude['telephone'].iloc[row].replace('(33)', '+33').replace('(0033)', '+33').replace('(033)', '+33').replace('(','').replace("(0687026502)", "0687026502").replace("(Portable)", "").replace("(0634121580) 063-4121580", "063-4121580").replace("(0687026502) 068-7026502", "068-7026502").replace("(0689289829) ¨", "0689289829").replace("(0689289829) 068-9289829", "0689289829").replace("(0607991130) 060-7991130", "0607991130").replace("(0683365627) 068-3365627", "0683365627").replace(')',"").replace("-","")

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy [2884996022.py:3]


In [410]:

for row in range(len(attitude.schedule)):
    if attitude['schedule'].iloc[row] != 0:
        try:
            attitude['schedule'].iloc[row] = attitude['schedule'].iloc[row].replace('\n1',' ').replace('\n', "").strip().split(" ")[0] + ", " + attitude['schedule'].iloc[row].replace('\n',' ').replace('1','').strip().split(" ")[1]
        except:
            continue
    else:
        continue

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy [2969231407.py:4]


In [425]:
attitude.schedule = schedule

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy [810273013.py:1]


In [391]:
attitude[attitude['course'] == ""]
attitude['course'].iloc[0] = "Barre à Terre"
attitude['schedule'].iloc[0] = "Mardi, 12h15"
attitude['course2'].iloc[0] = 'Pbt'
attitude['schedule2'].iloc[0] = "Jeudi, 9h30"
attitude['course3'].iloc[0] = 'Classique moyen'
attitude['schedule3'].iloc[0] = "Vendredi, 10h"




A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy [113183060.py:1]
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy [113183060.py:2]
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy [113183060.py:3]
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy [113183060.py:4]
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: htt

In [None]:
attitude_eleves = attitude.drop(columns=['course', 'schedule', 'course2', 'schedule2',
       'course3', 'schedule3', 'registration', 'installments', 'total'])

In [None]:
cols = ["Nom et prénom d'élève", "Date de naissance", "Adresse", "E-mail", "Téléphone", "Représentant légal de l’inscrit (pour les mineurs)", "Cours", "Horaire",  "Cours 2", "Horaire 2",  "Cours 3", "Horaire 3", "Adhésion", "Paiement fractionné", "Paiement Total"]
attitude.columns = cols