In [3]:
pip install PDFplumber

^C
Note: you may need to restart the kernel to use updated packages.


In [97]:
import pandas as pd
import pdfplumber
import re
import os, os.path
from dateutil import parser
from sshtunnel import SSHTunnelForwarder
import sqlalchemy as db
from sqlalchemy import create_engine

## Extracting available data

### Prior data
When the school provided those PDF documents, which were extracted from a free online form builder for school applications, my first action was to take a look in the very first pages of the first document and understand it. I realized that, instead of organized tables with information about the students, all data was stored in form-based PDFs.  It was very look alike a printed document filled by each student separately. The main goal of keeping those documents were just records of students inscriptions, with no strategic business intent for that data.

Therefore, I've proposed a different pipeline for that data, that not only could escalate, enabling larger volumes of data, but also that allowed the school to catalogue, clean, filter, manipulate and analyze all that value informations to find the best business solutions.


In [9]:
pdf_test = pdfplumber.open(".\\data\\211596498612667-0.pdf")
print(len(pdf_test.pages))
print(pdf_test.pages[0].extract_text())
print(pdf_test.pages[1].extract_text())
print(pdf_test.pages[2].extract_text())
print(pdf_test.pages[3].extract_text())

30
Sunday, June 20, 2021
Fiche d'inscription 
Nom et prénom d'élève  Blatt Luce
Date de naissance 10 19 1960
Adresse Rue des vincennes, 9
Toulouse, 31500
E-mail marieluceblatt@gmail.com
Téléphone (0033) 607-103468
Cours:      
Horaire:   
Lundi      12h15 Barre à terre   
Lundi Heure
Mardi    9h Barre à terre   
Mardi Heure
          
Jeudi    9h30 pbt     
Jeudi Heure
Vendredi  10h classique moyen 
Vendredi Heure
      
            
Téléverser le Certi cat Médical
CamScanner 06-20-2021 20.39.pdf
pdf
1
Create your own automated PDFs with Jotform PDF Editor- It’s free
Téléverser le Certi cat d’assurance 
extra-scolaire ou assurance civil
66_CamScanner 06-20-2021 20.39_8995.pdf
pdf
Le paiement du cours sera effectué avec Par 4 cheques
(1-10)
 chèques deXxx  €, au total de     + 30€
(valeur de chaque chèque)
adhésion .
Moins 10€ pour le paiement comptant ou trimestre.
 Découvrez notre planning et nos tarifs a
attitudecorpsetdanses.com/tarifs-et-planning
Pour l’abonnement annuel à Attitude

### Proposing a new way to store data

So, for a better comprehension of the available data, i've tried to create an organized table with all subscriptions information. Regardless any particularity of each student, they all filled the same form for submission, so I could easily identify the fields despite there was no obvious separators between fields.

The inscription form provides lots of information about each student. I created lists to store these informations, according to the fields filled.

In [10]:
def read_pdfs (pdf):
    """open the pdf file and extract all the text information corresponding to each field
    return info"""
    info = []
    for page in range(len(pdf.pages)):
        info.append(pdf.pages[page].extract_text())
    info = "".join(info).replace("Create your own automated PDFs with Jotform PDF Editor- It’s free","").split("Fiche d'inscription")
    for student in range(1,len(info)):
        try:
            nom = info[student].split("Nom et prénom d'élève")[1].split("\n")[0].strip()
        except:
            nom = 0
        try:
            naissance = info[student].split("Date de naissance")[1].split("\n")[0].strip()
        except:
            naissance = 0
        try:
            adresse = info[student].split("Adresse")[1].split("\n")[0].replace("\n"," ").strip()
        except:
            adresse = 0
        try:
            cite = info[student].split("Adresse")[1].split("\n")[1].split(",")[0].replace("\n"," ").strip()
        except:
            cite = 0
        try:
            postal = info[student].split("Adresse")[1].split("\n")[1].split(",")[1].split("\n")[0].replace("\n"," ").strip()
        except:
            postal = 0
        try:
            email = info[student].split("E-mail")[1].split("\n")[0].strip()
        except:
            email=0
        try:
            representant_legal = info[student].split("Représentant légal de l’inscrit (pour")[1].split("les mineurs")[0].strip()
        except:
            representant_legal = 0
        try:
            tel = info[student].split("Téléphone")[1].split("\n")[0].strip()
        except:
            tel = info[student].split("Téléphone")[1].split("\n")[0].strip()
        try:
            cours = info[student].split("Cours:")[1].split("Horaire:")[0].replace("\nCours", "").strip()
        except:
            cours = 0
        try:
            horaire = info[student].split("Horaire:")[1].split("Cours 2")[0].replace("\xa0", "").replace("Heure", "").strip()
            cours2 = info[student].split("Cours 2:")[1].split("Horaire:")[0].replace("\nCours", "").strip()
            try:
                horaire2 = info[student].split("Horaire:")[2].split("Cours 3")[0].replace("\xa0", "").replace("Heure", "").strip()
                cours3 = info[student].split("Cours 3:")[1].split("Horaire:")[0].replace("\nCours", "").strip()
                horaire3 = info[student].split("Horaire:")[3].split("Téléverser")[0].replace("Heure", "").strip()
            except: 
                horaire2= info[student].split("Horaire:")[2].split("Téléverser")[0].replace("\xa0", "").replace("Heure", "").strip()
                cours3 =0
                horaire3=0
        except:
            try:
                horaire = info[student].split("Horaire:")[1].split("Téléverser")[0].replace("\xa0", "").replace("Heure", "").strip()
            except:
                horaire = 0
            cours2 =0
            horaire2=0
            cours3 =0
            horaire3=0
        try:
            adhesion = info[student].split("\xa0\xa0\xa0\xa0\xa0+")[1].split("\n(valeur de chaque chèque)")[0].strip()
        except: 
            adhesion = info[student].split("Please Select")[1].split("adhésion")[0].strip() 
        try:
            paiement_fractionne =info[student].split("Par")[1].split(" cheques")[0].strip()
        except:
            paiement_fractionne =info[student].split("avec")[1].split("chèques")[0].strip()
        paiement_total = info[student].split("au total de")[1].split("€")[0].strip()
        name.append(nom)
        birthday.append(naissance)
        address.append(adresse)
        city.append(cite)
        pcode.append(postal)
        mail.append(email)
        telephone.append(tel)
        legal_representative.append(representant_legal)
        course.append(cours)
        schedule.append(horaire)
        course2.append(cours2)
        schedule2.append(horaire2)
        course3.append(cours3)
        schedule3.append(horaire3)
        registration.append(adhesion)
        installments.append(paiement_fractionne)
        total.append(paiement_total)

All available data was stored in the same folder, and the files' name differ only by the number at the end of them, from 0 to the last. 

In [11]:
name=[]
birthday=[]
address = []
city = []
pcode =[]
mail =[]
telephone =[]
legal_representative =[]
course=[]
schedule =[]
course2 =[]
schedule2 =[]
course3=[]
schedule3 =[]
registration =[]
installments =[]
total =[]
files = os.listdir('C:\\Users\\Tete\\Curso - DA\\Projeto Final\\data') 
for file in range(len(files)-1):
    pdf = pdfplumber.open(f".\\data\\211596498612667-{file}.pdf")
    read_pdfs (pdf)

### Creating the dataframe
After extracting all the information avaialable on those forms, I've gathered them in a dataframe which columns are the fields from the submission form.

In [12]:
attitude = pd.DataFrame(zip(name,birthday, address, city, pcode, mail, telephone, legal_representative, course, schedule, course2, schedule2, course3, schedule3, registration, installments, total))
attitude.columns = ['name','birthday', 'address', 'city', 'pcode','mail', 'telephone', 'legal_representative', 'course', 'schedule', 'course2', 'schedule2', 'course3', 'schedule3', 'registration', 'installments', 'total']

In [13]:
attitude

Unnamed: 0,name,birthday,address,city,pcode,mail,telephone,legal_representative,course,schedule,course2,schedule2,course3,schedule3,registration,installments,total
0,Blatt Luce,10 19 1960,"Rue des vincennes, 9",Toulouse,31500,marieluceblatt@gmail.com,(0033) 607-103468,0,,Lundi 12h15 Barre à terre\nLundi \nMardi 9h Ba...,0,0,0,0,30€,4,+ 30
1,Paulon Lily,24/07/14,"Renée Aspe, 2",Toulouse,31000,do_julia@hotmail.com,(+33) 648-949998,Paulon Julia,Classique,Lundi 17h\nLundi \n\n\n\n\n\n\n1,0,0,0,0,30€,4,500
2,MEZARD Emmanuelle,02/03/1970,"44, rue Sarah Bernhardt, 44, rue Sarah Bernhardt",Toulouse,31200,emmanuelle.mezard@free.fr,(6) 034-64351,0,BARRE MOYEN,Mardi 19H\nMardi \n\n\n\n\n\n1,PBT + BALLET FITNESS,Jeudi 9H30\nJeudi,BARRE MOYEN ( + pointes ),Vendredi 19h -20h30 \nVendredi,30€,3,880
3,CORBIERE BEATRIX,20/09/1960,"Jean GAYRAL, 83",TOULOUSE,31200,lacabiche@free.fr,(33) 066-3253652,0,,Lundi 20h30\nLundi \n\n\n\n\n\n\n1,,19h,0,0,30€,1,720
4,François Eve,08/11/1948,"Avenue Winston Churchill, 10",Toulouse,31100,francoiseve9@gmail.com,(+33) 761-113634,0,Barre au sol,Lundi 12h15\nLundi \n\n\n\n\n\n\n1,Barre,Vendredi 19h\nVendredi,0,0,30€,3,720
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,Peccia-Galletto Sasha,17 juin 2016,"Monplaisir, 40",Toulouse,31400,ln.seguela@gmail.com,(06) 249-71201,Séguéla Hélène,débutant,Mardi 17H\nMardi,0,0,0,0,30€,3,500
96,Galvani Francoise,09/04/1959,6 rue Pierre de Fermat,Toulouse,31000,francoisegalvani@icloud.com,(0033) 609-380633,0,0,0,0,0,0,0,30€ adhésion .,1,+ 30
97,lemozit sasha,10/05/2007,"alleee des demoiselles, 6",toulouse,31400,christophelemozit@yahoo.fr,(33) 682-463223,lemozit christophe,Moderne,Vendredi 18h\nVendredi,0,0,0,0,30€,3,500
98,BLANC BOUNY CELESTE,13/07/2013,"RUE GALILEE, 9",TOULOUSE,31500,clairebouny0412@yahoo.fr,(06) 811-70151,BOUNY CLAIRE,DANSE CLASSIQUE DEBUTANT,Lundi 17H\nLundi,0,0,0,0,30€,3,500


## Transforming data   

### Checking columns:

In [14]:
#Dropping duplicates
attitude = attitude.drop_duplicates().reset_index(drop=True)

In [15]:
#Let's standardize it!
# Strings
attitude.name = [nom.title() for nom in attitude.name]
attitude.address = [adresse.title() for adresse in attitude.address]
attitude.city = [cite.title() for cite in attitude.city]
attitude.mail = [email.lower() for email in attitude.mail]
attitude.legal_representative = [representant.title() if representant != 0 else 0 for representant in attitude.legal_representative]
attitude.course = [cours.title() if cours != 0 else 0 for cours in attitude.course]
attitude.course2= [cours2.title() if cours2 != 0 else 0 for cours2 in attitude.course2]
attitude.course3= [cours3.title() if cours3 != 0 else 0 for cours3 in attitude.course2]

In [16]:
#Birthday column
issues = []
for naissance in attitude.birthday:
    try:
        date = parser.parse(naissance)
    except:
        issues.append(naissance)
print(issues)
naissances=[]
for naissance in attitude.birthday:
    if naissance != 0:
        date = parser.parse(naissance.replace('25 nivelbre 1947', '25-11-1947').replace('18 décembre 2012', '18-12-2012').replace('17 août 2008', '17-08-2008').replace('4 juin 1975', '04-06-1975').replace('1er mai 2014', '01-05-2014').replace('18 AOUT 2010', '18-08-2010').replace('23 JANVIER 2017', '23-01-2017').replace('30061986', '30-06-1986').replace('17 juin 2016', '17-06-2016'))
        naissances.append(date.strftime('%d-%m-%Y'))
    else:
        naissances.append(0)
attitude.birthday = naissances

['25 nivelbre 1947', '18 décembre 2012', '17 août 2008', '4 juin 1975', '1er mai 2014', 0, '18 AOUT 2010', '23 JANVIER 2017', 0, '30061986', '17 juin 2016']


In [17]:
#Address column:
attitude['address'].iloc[2] = 'Rue Sarah Bernhardt, 44'
attitude['address'].iloc[1] = '11 Rue Du Docteur Charles Bonneau'


In [18]:
#Column Telephone:
for row in range(len(attitude.telephone)):
    attitude['telephone'].iloc[row] = attitude['telephone'].iloc[row].replace('(33)', '+33').replace('(0033)', '+33').replace('(033)', '+33').replace('(','').replace("(0687026502)", "0687026502").replace("(Portable)", "").replace("(0634121580) 063-4121580", "063-4121580").replace("(0687026502) 068-7026502", "068-7026502").replace("(0689289829) ¨", "0689289829").replace("(0689289829) 068-9289829", "0689289829").replace("(0607991130) 060-7991130", "0607991130").replace("(0683365627) 068-3365627", "0683365627").replace("0689289829 ¨", '0689289829').replace('.',"").replace(')',"").replace("-","")

In [19]:
def schedules (schedule_num):
    issues_horaire =[]
    for row in range(len(attitude[schedule_num])):
        if attitude[schedule_num].iloc[row] != 0:
            try:
                horaire = attitude[schedule_num].iloc[row].replace('\n1',' ').strip().split('\n')
                if len(horaire) == 2:
                    attitude[schedule_num].iloc[row] = attitude[schedule_num].iloc[row].replace('\n1',' ').strip().split('\n')[0].lower()
                else:
                    issues_horaire.append(row)
            except:
                continue
        else:
            continue
    return issues_horaire

def schedule_course (schedule_num, course_num):
    for row in range(len(attitude[schedule_num])):
        if attitude[schedule_num].iloc[row] == 'mardi 19h' or attitude[schedule_num].iloc[row] == 'mardi 18h' or attitude[schedule_num].iloc[row] =='vendredi 19h' or attitude[schedule_num].iloc[row] =='lundi 10h15':
            attitude[course_num].iloc[row] = 'Classique Moyen'
        elif attitude[schedule_num].iloc[row] == 'lundi 12h15' or attitude[schedule_num].iloc[row] == 'mardi 9h' or attitude[schedule_num].iloc[row] == 'samedi 12h15' or attitude[schedule_num].iloc[row] == 'samedi 12h' or attitude[schedule_num].iloc[row] == 'mardi 10h':
            attitude[course_num].iloc[row] = 'Barre à Terre'
        elif attitude[schedule_num].iloc[row] == 'mercredi 14h15':
            attitude[course_num].iloc[row] == 'Classique 1'
        elif attitude[schedule_num].iloc[row] == 'lundi 17h':
            attitude[course_num].iloc[row] = 'Préparatoire'    
        elif attitude[schedule_num].iloc[row] == 'mercredi 16h30' or attitude[schedule_num].iloc[row] == 'vendredi 20h30':
            attitude[course_num].iloc[row] = 'Pointes'
        elif attitude[schedule_num].iloc[row] == 'mardi 17h':
            attitude[course_num].iloc[row] = 'Éveil'
        elif attitude[schedule_num].iloc[row] == 'jeudi 17h10' or attitude[schedule_num].iloc[row] == 'jeudi 17h15':
            attitude[course_num].iloc[row] = 'Initiation'
        elif attitude[schedule_num].iloc[row] == 'mercredi 17h45':
            attitude[course_num].iloc[row] = 'Classique 2'
        elif attitude[schedule_num].iloc[row] == 'mercredi 14h25' or attitude[schedule_num].iloc[row] =='mercredi 14h15' or attitude[schedule_num].iloc[row] =='mercredi 13h15':
            attitude[course_num].iloc[row] = 'Classique 1'
        elif attitude[schedule_num].iloc[row] == 'lundi 18h':
            attitude[course_num].iloc[row] = 'Contemporain'
        elif attitude[schedule_num].iloc[row] == 'jeudi 20h' or attitude[schedule_num].iloc[row] =='lundi 20h30' or attitude[schedule_num].iloc[row] =='mercredi 9h' or attitude[schedule_num].iloc[row] =='jeudi 19h30':
            attitude[course_num].iloc[row] = 'Pilates'
        elif attitude[schedule_num].iloc[row] == 'lundi 10h':
            attitude[course_num].iloc[row] = 'Classique Moyen'
        elif attitude[schedule_num].iloc[row] == 'lundi 18h' or attitude[schedule_num].iloc[row] == 'mercredi 15h30':
            attitude[course_num].iloc[row] = 'Pbt'
        elif attitude[schedule_num].iloc[row] == 'jeudi 18h30' or attitude[schedule_num].iloc[row] == 'lundi 19h' or attitude[schedule_num].iloc[row] =='mercredi 19h30' or attitude[schedule_num].iloc[row] == 'samedi 10h40' or attitude[schedule_num].iloc[row] == 'samedi 10h30' or attitude[schedule_num].iloc[row] == 'lundi 19h15' or attitude[schedule_num].iloc[row] == 'mardi 19h30' or attitude[schedule_num].iloc[row] == 'mardi 19h15' or attitude[schedule_num].iloc[row] == 'jeudi 18h30':
            attitude[course_num].iloc[row] = 'Classique Interm. – Avancé'
        elif attitude[schedule_num].iloc[row] == 'vendredi 10h15' or attitude[schedule_num].iloc[row] =='vendredi 10h' or attitude[schedule_num].iloc[row] =='vendredi 19h' or attitude[schedule_num].iloc[row] =='lundi 10h' or attitude[schedule_num].iloc[row] =='lundi 19h15':
            attitude[course_num].iloc[row] = 'Classique Moyen'
        elif attitude[schedule_num].iloc[row] == 'lundi 19h30':
            attitude[course_num].iloc[row] = 'Classique Avancé'
        elif attitude[schedule_num].iloc[row] == 'vendredi 18h':
            attitude[course_num].iloc[row] = 'Moderne'
        elif attitude[schedule_num].iloc[row] == 'jeudi 9h30':
            attitude[course_num].iloc[row] = 'Pbt + Ballet Fitness'    

In [20]:
issues_horaire = schedules ('schedule')
issues_horaire

[0, 5, 11, 20, 25, 33, 58, 72, 89]

In [21]:
attitude['schedule'].iloc[0] = "mardi 12h15"
attitude['schedule2'].iloc[0] = "jeudi 9h30"
attitude['schedule'].iloc[5] = "Lundi, 12h15"
attitude['schedule2'].iloc[5] = "mardi 18h  vendredi 19h"
attitude['course'].iloc[11] = "barre à rerre"
attitude['schedule'].iloc[11] = "lundi 12h15 mardi 9h"
attitude['schedule'].iloc[20] = "lundi 19h15"
attitude['schedule2'].iloc[20] = "jeudi 18h30 vendredi 20h30"
attitude['schedule'].iloc[25] = "lundi 12h15"
attitude['schedule2'].iloc[25] = "vendredi 19h"
attitude['schedule'].iloc[28] ='mercredi 15h30'
attitude['schedule'].iloc[33] = "mardi 9h"
attitude['schedule2'].iloc[33] = "vendredi 10h15"
attitude['course'].iloc[58] = "classique avancé"
attitude['schedule'].iloc[58] = "mardi 19h15 jeudi 18h30"
attitude['course2'].iloc[58] = "pbt"
attitude['schedule2'].iloc[58] = "mercredi 15h30"

In [22]:
attitude.schedule.unique()
for row in range(len(attitude.schedule)):
    if attitude['schedule'].iloc[row] != 0:
        attitude['schedule'].iloc[row] = attitude['schedule'].iloc[row].lower().replace(',', '').replace(':', 'h').replace('h00', 'h').replace(' h ', 'h').replace('vendredi 19h-20h30', 'vendredi 19h').replace('mardi 18h/19h30', 'mardi 18h').replace('mardi 18h-19h30', 'mardi 18h').replace('mercredi 16h30/17h30','mercredi 16h30').replace('mercredi 17h45-19h15','mercredi 17h45').replace('mardi 10 ', 'mardi 10h').replace('jeudi 17h15 - 18h15','jeudi 17h15').replace('lundi 17 ', 'lundi 17h').replace('jeudi 17h15-18h15', 'jeudi 17h15').replace('mercredi 14h15-15h30','mercredi 14h15').replace('mardi 17h - 17h45', 'mardi 17h').replace('jeudi 19h30-21h', 'jeudi 19h30').replace('lundi 17-18h', 'lundi 17h').replace('jeudi 20h - 21h', 'jeudi 20h').replace('lundi 10h11h30', 'lundi 10h').replace('lundi 18 ','lundi 18h').replace('lundi 19h 15 et 20h45', 'lundi 19h15').replace('mercredi 16.30/17.30', 'mercredi 16h30').replace('mardi 9.00/10.00', 'mardi 9h').replace('jeudi 18h30 - 20h', 'jeudi 18h30').replace('mardi 19h30 - 20h45','mardi 19h30').replace('10h30\n\n\n\n\n\nsamedi \nsamedi\n\n1', 'samedi 10h30').replace('mardi 12h15','lundi 12h15').replace('mardi 17 ','mardi 17h').replace('mercredi 9 ', 'mercredi 9h')
attitude.schedule.unique()

array(['lundi 12h15', 'lundi 17h', 'mardi 19h', 'lundi 20h30',
       'mercredi 14h15', 'mardi 17h', 'mardi 18h', 'samedi 12h',
       'lundi 12h15 mardi 9h', 'vendredi 19h', 'mercredi 16h30',
       'lundi 19h15', 'jeudi 17h10', 'mercredi 17h45', 'mardi 19h30',
       'mercredi 15h30', 'lundi 10h15', 'samedi 10h30', 'jeudi 17h15',
       'mardi 10h', 'mardi 9h', 'mercredi ', 'lundi 18h', 'jeudi 19h30',
       0, 'jeudi 20h', 'mardi 9h ', 'mercredi 13h15', 'lundi 10h',
       'mardi 19h15 jeudi 18h30', 'jeudi 18h30', 'mercredi 19h30',
       'vendredi 10h15', 'samedi ', 'mercredi 9h', 'jeudi ', '', 'lundi ',
       'vendredi 18h'], dtype=object)

In [23]:
issues_horaire2 = schedules ('schedule2')
issues_horaire2

[0, 3, 5, 19, 20, 25, 33, 58]

In [24]:
attitude.schedule2.unique()
for row in range(len(attitude.schedule2)):
    if attitude['schedule2'].iloc[row] != 0:
        attitude['schedule2'].iloc[row] = attitude['schedule2'].iloc[row].replace(',', '').replace(':', 'h').replace('h00', 'h').replace(' h ', 'h').replace('Jeudi, 9h30', 'jeudi 9h30').replace('Mardi, 18h,  Vendredi, 19h', 'mardi 18h,  vendredi 19h').replace('vendredi 20h30-21h15','vendredi 20h30').replace('vendredi 18/19h','vendredi 18h').replace('mercredi 19h30-21h', 'mercredi 19h30').replace('Mercredi 17h45/19h15\nMercredi \n\n\n\n\n2','Mercredi 17h45').replace('Jeudi, 18h30, Vendredi, 20h30','jeudi 18h30, vendredi 20h30').replace('mercredi 17.45/19.15','mercredi 17h45').replace('mercredi 15h30 - 16h30', 'mercredi 15h30').lower()
attitude.schedule2.unique()

array(['jeudi 9h30', 0, '19h', 'vendredi 19h', 'mardi 18h  vendredi 19h',
       'mercredi 15h30', 'vendredi 20h30', 'samedi 12h15', 'vendredi 18h',
       'mercredi 19h30', 'mercredi 17h45', 'jeudi 18h30 vendredi 20h30',
       'samedi 10h30', 'mercredi 16h30', 'vendredi 10h15', 'lundi 20h45',
       'samedi 10h40', 'lundi 19h15', 'jeudi ', 'mercredi ', 'vendredi '],
      dtype=object)

In [25]:
issues_horaire3 = schedules ('schedule3')
issues_horaire3

[18, 20, 57, 58, 92]

In [26]:
attitude['schedule3'].iloc[18] = 'vendredi 20h30'
attitude['schedule3'].iloc[20] = 'vendredi 20h30'
attitude['schedule3'].iloc[57] = 'vendredi 10h'
attitude['schedule3'].iloc[58] = 'mercredi 17h45'
attitude['schedule3'].iloc[92] = 'samedi 10h30'

In [27]:
attitude.schedule3.unique()
for row in range(len(attitude.schedule3)):
    if attitude['schedule3'].iloc[row] != 0:
        attitude['schedule3'].iloc[row] = attitude['schedule3'].iloc[row].lower().replace(',', '').replace(':', 'h').replace('h00', 'h').replace(' h ', 'h').replace('vendredi \xa019h -20h30\xa0', 'vendredi 19h').replace('vendredi \xa019/20h30\xa0','vendredi 19h').replace('mercredi \xa017h45\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0', 'mercredi 17h45').replace('samedi \xa0\xa0\xa010h30\xa0\xa0\xa0', 'samedi 10h30').replace('\xa0', '')
attitude.schedule3.unique()

array([0, 'vendredi 19h', 'vendredi 20h30', 'mercredi 17h45',
       'samedi 10h30', 'vendredi 10h', 'mercredi ', 'vendredi '],
      dtype=object)

In [28]:
schedule_course ('schedule', 'course')
schedule_course ('schedule2', 'course2')
schedule_course ('schedule3', 'course3')

In [29]:
attitude[attitude['course']==0]
attitude['course'].iloc[39] = "pilates"
attitude['schedule'].iloc[39] = "juedi 20h"
attitude['course'].iloc[41] = 'classique 1'
attitude['course'].iloc[51] = 'classique 1'
attitude['course'].iloc[78] = 'classique 1'
attitude['course'].iloc[79] = 'classique moyen'
attitude['schedule'].iloc[79] = "vendredi 10h"
attitude['course'].iloc[79] = 'classique interm. – avancé'
attitude['schedule'].iloc[93] = "vendredi 10h"
attitude['course'].iloc[93] = 'classique interm. – avancé'
attitude['course'].iloc[46] = "pilates"
attitude['schedule'].iloc[46] = "lundi 9h mercredi 9h vendredi 9h"
attitude['course'].iloc[46] = "classique moyen"
attitude['schedule'].iloc[46] = "lundi 10h"
attitude['course'].iloc[46] = "pbt + ballet fitness"
attitude['schedule'].iloc[46] = "juedi 9h30"
attitude['course'].iloc[56] = "pilates"
attitude['schedule'].iloc[56] = "mercredi 9h vendredi 9h"
attitude['course'].iloc[66] = "pilates"
attitude['schedule'].iloc[66] = "lundi 9h"
attitude['course'].iloc[83] = "pilates"
attitude['schedule'].iloc[83] = "lundi 20h45"
attitude['course2'].iloc[83] = 0
attitude['schedule2'].iloc[83] = 0
attitude['course'].iloc[95] = "barre à terre"
attitude['schedule'].iloc[95] = "mardi 9h"

In [30]:
attitude.course.unique()
for row in range(len(attitude.course)):
    if attitude['course'].iloc[row] != 0:
        attitude['course'].iloc[row] = attitude['course'].iloc[row].replace("Classiquee Avancé", "Classique Avancé").replace('barre à rerre', 'barre à terre').replace('inter/ avance', 'classique interm. – avancé').lower()
        if attitude['course'].iloc[row] =='classique interm':
            attitude['course'].iloc[row] = ('classique interm. – avancé')
        if attitude['course'].iloc[row] =='1' or attitude['course'].iloc[row] =='classique':
            attitude['course'].iloc[row] = ("classique 1")
    elif attitude['course'].iloc[row] == "":
        attitude['course'].iloc[row] == 0
attitude.course.unique()

array(['barre à terre', 'préparatoire', 'classique moyen', 'pilates',
       'classique 1', 'éveil', 'pointes', 'classique interm. – avancé',
       'initiation', 'classique 2', 'pbt', 'contemporain',
       'pbt + ballet fitness', 'classique avancé', 'carte 10 cours',
       'moderne'], dtype=object)

In [32]:
attitude.course2.unique()
for row in range(len(attitude.course2)):
    if attitude['course2'].iloc[row] != 0:
        attitude['course2'].iloc[row] = attitude['course2'].iloc[row].replace("Classiquee Avancé", "Classique Avancé").replace('barre à rerre', 'barre à terre').replace('inter/ avance', 'classique interm. – avancé').lower()
        if attitude['course2'].iloc[row] =='barre':
            attitude['course2'].iloc[row] = ("barre à terre")
        if attitude['course2'].iloc[row] =='classique interm':
            attitude['course2'].iloc[row] = ('classique interm. – avancé')
        if attitude['course2'].iloc[row] =='classique':
            attitude['course2'].iloc[row] = ("classique 1")
    elif attitude['course2'].iloc[row] == "":
        attitude['course2'].iloc[row] == 0
attitude.course2.unique()

array(['pbt + ballet fitness', 0, '', 'classique moyen', 'pbt', 'pointes',
       'barre à terre', 'moderne', 'classique interm. – avancé',
       'classique 2'], dtype=object)

In [33]:
attitude.course3.unique()
for row in range(len(attitude.course3)):
    if attitude['course3'].iloc[row] != 0:
        attitude['course3'].iloc[row] = attitude['course3'].iloc[row].replace("Classiquee Avancé", "Classique Avancé").replace('barre à rerre', 'barre à terre').replace('inter/ avance', 'classique interm. – avancé').lower()
        if attitude['course3'].iloc[row] =='barre':
            attitude['course3'].iloc[row] = ("barre à terre")
        if attitude['course3'].iloc[row] =='classique':
            attitude['course3'].iloc[row] = ("classique 1")
    elif attitude['course3'].iloc[row] == "":
        attitude['course3'].iloc[row] == 0
attitude.course3.unique()

array([0, 'classique moyen', '', 'barre à terre', 'pbt', 'pointes',
       'classique 2', 'classique 1', 'classique interm. – avancé'],
      dtype=object)

In [34]:
for row in range(len(attitude.registration)):
    attitude.registration.iloc[row] = '30€'

In [35]:
for row in range(len(attitude.installments)):
    attitude.installments.iloc[row] = attitude.installments.iloc[row].replace('10 chéques\n(1-10)', '10').replace('9 + 1', '10').replace("lement\ntoulouse, 31000\nE-mail juliereglat@gmail.com\nTéléphone (06) 760-48948\nReprésentant légal de l’inscrit (pour  sécail-Réglat Julie\nles mineurs)\nCours:Initiation\xa0\xa0\xa0\xa0\xa0\xa0\nCours\nHoraire:\xa0\xa0\xa0\n\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\n\xa0\xa0\xa0\xa0\xa0\xa0\n\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\nJeudi \xa0\xa0\xa017h15 - 18h15\xa0\xa0\xa0\xa0\xa0\nJeudi Heure\n\xa0\xa0\n\xa0\xa0\xa0\xa0\xa0\xa0\n\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\nTéléverser le Certi\x00cat Médical\nLouise.pdf\npdf\n1\nTéléverser le Certi\x00cat d’assurance \nextra-scolaire ou assurance civil\nMacif_Attestation Responsabilité Civile.pdf\npdf\nLe paiement du cours sera effectué avec\xa03 \xa0chèques\n(1-10)\nde196+147+147  €, au total de\xa0\xa0\xa0 500€ (1 cours) \xa0\xa0+\n(valeur de chaque chèque) Please Select\n30€ adhésion .\nMoins 10€\xa0pour le paiement comptant ou trimestre.\n\xa0Découvrez notre\xa0planning et nos tarifs\xa0a\nattitudecorpsetdanses.com/tarifs-et-planning\nPour l’abonnement annuel à Attitude \nJe suis d'accord\nCorps et Danses de la saison \n2021/2022 je ne pourrai en aucun cas \nfaire opposition à mes chèques ( voir \narticle L131-35 du code monétaire et \n\x00nancier) ou en demander la \nrestitution en cas d’arrêt de ma part.\nJ'autorise l'autorisation de droit à \nOui\nl'image et/ou à la voix pour la \npromotion de l'Attitude Corps et \nDanses.\nJe reconnais avoir pris connaissance \nOui\ndu règlement intérieur *, des \nconditions générales d’inscriptions* \nde l’Association Attitude Corps et \nDanses, d’avoir présenté un certi\x00cat \nmédical de non-contre indication à la \npratique de la danse et d’avoir \nprésenté un certi\x00cat d’assurance \nextra-scolaire ou assurance civil.* \n(*Règlement intérieur/ conditions \ngénérales disponibles sur: \nhttps://attitudecorpsetdanses.com/re\nglement-interieur/*)\nSignature\n2\nSunday, July 11, 2021", '3')
    
attitude.installments.value_counts()

3      39
1      34
10     14
4       5
2       3
5       2
0       1
shh     1
Name: installments, dtype: int64

In [36]:
cols = ["Nom et prénom d'élève", "Date de naissance", "Adresse", "Cité", "Code Postale", "E-mail", "Téléphone", "Représentant légal de l’inscrit (pour les mineurs)", "Cours", "Horaire",  "Cours 2", "Horaire 2",  "Cours 3", "Horaire 3", "Adhésion", "Paiement fractionné", "Paiement Total"]
attitude.columns = cols

In [274]:
for row in range(len(attitude['Paiement Total'])):
    if attitude['Paiement Total'].iloc[row] == '+ 30':
        if attitude['Cours'].iloc[row] == 'carte 10 cours':
            attitude['Paiement Total'].iloc[row] = 1000
        elif attitude['Cours 2'].iloc[row] == 0:
            attitude['Paiement Total'].iloc[row] = 500
        else:
            if attitude['Cours 3'].iloc[row] == 0:
                attitude['Paiement Total'].iloc[row] = 720
            else:
                attitude['Paiement Total'].iloc[row] = 880
attitude['Paiement Total'].iloc[46] = 1040
attitude['Paiement Total'].iloc[53] = 720
attitude['Paiement Total'].iloc[54] = 500
attitude['Paiement Total'].iloc[58] = 1040
for row in range(len(attitude['Paiement Total'])):
    attitude['Paiement Total'].iloc[row] = int(attitude['Paiement Total'].iloc[row])
print(attitude['Paiement Total'].unique())

[720 500 880 1000 1040]


In [293]:
attitude_eleves = attitude.drop(columns=["Cours", "Horaire",  "Cours 2", "Horaire 2",  "Cours 3", "Horaire 3", "Adhésion", "Paiement fractionné", "Paiement Total"])
attitude_cours = attitude.drop(columns=[ "Date de naissance", "Adresse", "Cité", "Code Postale", "E-mail", "Téléphone", "Représentant légal de l’inscrit (pour les mineurs)", "Adhésion", "Paiement fractionné", "Paiement Total"])
attitude_paiement = attitude.drop(columns=["E-mail", "Date de naissance", "Adresse", "Cité", "Code Postale","E-mail", "Téléphone", "Représentant légal de l’inscrit (pour les mineurs)", "Cours", "Horaire",  "Cours 2", "Horaire 2",  "Cours 3", "Horaire 3"])

In [294]:
courses = []
for cours in attitude_cours['Cours'].unique():
    if cours != 0 and cours != '1' and cours != "":
        courses.append(cours)
for cours2 in attitude_cours['Cours 2'].unique():
    if cours2 != 0 and cours2 != '1' and cours != "":
        courses.append(cours2)
for cours3 in attitude_cours['Cours 3'].unique():
    if cours3 != 0 and cours3 != '1' and cours != "":
        courses.append(cours3)

courses = set(courses)
courses

{'',
 'barre à terre',
 'carte 10 cours',
 'classique 1',
 'classique 2',
 'classique avancé',
 'classique interm. – avancé',
 'classique moyen',
 'contemporain',
 'initiation',
 'moderne',
 'pbt',
 'pbt + ballet fitness',
 'pilates',
 'pointes',
 'préparatoire',
 'éveil'}

In [295]:
rows =[]
def students (course):
    cours =[]
    for row in range(len(attitude_cours['Cours'])):
        if attitude_cours['Cours'].iloc[row] == course or attitude_cours['Cours 2'].iloc[row] == course or attitude_cours['Cours 3'].iloc[row] == course:
            cours.append(attitude_cours["Nom et prénom d'élève"].iloc[row])
    return cours


Unnamed: 0,Nom et prénom d'élève,Date de naissance,Adresse,Cité,Code Postale,E-mail,Téléphone,Représentant légal de l’inscrit (pour les mineurs),Cours,Horaire,Cours 2,Horaire 2,Cours 3,Horaire 3,Adhésion,Paiement fractionné,Paiement Total
0,Blatt Luce,19-10-1960,"Rue Des Vincennes, 9",Toulouse,31500,marieluceblatt@gmail.com,+33 607103468,0,barre à terre,lundi 12h15,pbt + ballet fitness,jeudi 9h30,0,0,30€,4,720
1,Paulon Lily,24-07-2014,11 Rue Du Docteur Charles Bonneau,Toulouse,31000,do_julia@hotmail.com,+33 648949998,Paulon Julia,préparatoire,lundi 17h,0,0,0,0,30€,4,500
2,Mezard Emmanuelle,03-02-1970,"Rue Sarah Bernhardt, 44",Toulouse,31200,emmanuelle.mezard@free.fr,6 03464351,0,classique moyen,mardi 19h,pbt + ballet fitness,jeudi 9h30,classique moyen,vendredi 19h,30€,3,880
3,Corbiere Beatrix,20-09-1960,"Jean Gayral, 83",Toulouse,31200,lacabiche@free.fr,+33 0663253652,0,pilates,lundi 20h30,,19h,,0,30€,1,720
4,François Eve,11-08-1948,"Avenue Winston Churchill, 10",Toulouse,31100,francoiseve9@gmail.com,+33 761113634,0,barre à terre,lundi 12h15,classique moyen,vendredi 19h,barre à terre,0,30€,3,720
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
94,Peccia-Galletto Sasha,17-06-2016,"Monplaisir, 40",Toulouse,31400,ln.seguela@gmail.com,06 24971201,Séguéla Hélène,éveil,mardi 17h,0,0,0,0,30€,3,500
95,Galvani Francoise,04-09-1959,6 Rue Pierre De Fermat,Toulouse,31000,francoisegalvani@icloud.com,+33 609380633,0,barre à terre,mardi 9h,0,0,0,0,30€,1,500
96,Lemozit Sasha,05-10-2007,"Alleee Des Demoiselles, 6",Toulouse,31400,christophelemozit@yahoo.fr,+33 682463223,Lemozit Christophe,moderne,vendredi 18h,0,0,0,0,30€,3,500
97,Blanc Bouny Celeste,13-07-2013,"Rue Galilee, 9",Toulouse,31500,clairebouny0412@yahoo.fr,06 81170151,Bouny Claire,préparatoire,lundi 17h,0,0,0,0,30€,3,500


In [298]:
data_courses = dict()
for course in courses:
    try:
            data_courses[course]= students(course)
            print(course)
    except:
        continue
        
data_courses.values()


carte 10 cours
classique 1
pointes
classique interm. – avancé
éveil
classique 2
pbt
préparatoire
moderne
pilates
classique moyen
classique avancé
contemporain
barre à terre
pbt + ballet fitness
initiation


dict_values([['Corbiere Beatrix', 'Bernies-Abelanet Brigitte', 'Guez Anne Valerie', 'Virginie Casas/De Bienassis', 'Pyronnet-Masterson Christine', 'Vandenplas-Etchepare Géraldine', 'Vella-Lafage Marjorie', 'Jeandel Josephine', 'Bonnet Hortense', 'Prévôt Caroline', 'Belhon Josiane', 'Bonafé Marie', 'Galissier Anne-Lise'], ['Shawali Natacha'], ['Gaiddon Emma', 'Marty Desbordes Valentina', 'Sahal Dominique', 'Mahdi Lucie', 'Cassagnes Béatrice', 'Goulet-Thoumazet Justine', 'Tozge Marie', 'Ané Hector', 'Ané Erell', 'Izard Elodie', 'Placet Noémie', 'Soulie Louise', 'Waeghemaeker Alix', 'De Volontat Jeanne'], ['Virginie Casas/De Bienassis', 'Latge Nicole', 'Clamen Laborie Tessa', 'Bonnevialle Margaux', 'Vergé-Brian Mathilde', 'Ducos Emma', 'Ottavioli Justine', 'Moreaux Nancy', 'Bonnet Hortense', 'Fouilleron Heloïse', 'Van Rothem Juliette', 'De Volontat Jeanne'], ['Vandenplas-Etchepare Géraldine', 'Bonnevialle Margaux', 'Cassagnes Béatrice', 'Safa Helene', 'Moreaux Nancy', 'Tozge Marie', 'Herv

In [299]:
for course in data_courses.keys():
    if course != 0 or course !='carte 10 cours':
        data_courses[course].append('Shawali Natacha')

In [300]:
colunes=[]
for row in range(len(attitude_cours['Cours'])):
    rows=[]
    rows.append(attitude_cours["Nom et prénom d'élève"].iloc[row])
    for course in data_courses.keys():
        if attitude_cours['Cours'].iloc[row] == course or attitude_cours['Cours 2'].iloc[row] == course or attitude_cours['Cours 3'].iloc[row] == course:
            rows.append(1)
        else:
            rows.append(0)
    colunes.append(rows)
            
classes = pd.DataFrame(colunes)
classes.columns = ('nom','', 'carte 10 cours','classique 1','pointes','classique interm. – avancé','éveil','classique 2','pbt','préparatoire','moderne','pilates','classique moyen','classique avancé','contemporain','barre à terre','pbt + ballet fitness','initiation')
classes = classes.drop(['','carte 10 cours'], axis=1)

['Ferreira Constance', 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

In [302]:
attitude_eleves.to_csv('elevesdf.csv')
attitude_cours.to_csv('coursdf.csv')
classes.to_csv('classes.csv')
attitude_paiement.to_csv('paiment.csv')

In [303]:
def open_ssh_tunnel_and_mysql ():
    """Open an SSH tunnel and connect to a MySQL server using the SSH tunnel connection
    return Global MySQL connection"""
    
    try:    
        db_server= '127.0.0.1'
        user='tete'
        db_port = '3306'
        password = 'frida2202'
        ip = 'localhost'
        db_name = 'attitude'
        global server
        global connection
        global conn_addr            
        server = SSHTunnelForwarder(('138.197.99.33', 4242), ssh_username="tete", ssh_password="frida", remote_bind_address=('127.0.0.1', 3306))
        server.start()
        print('Tunnel oppend :-P')
        port = str(server.local_bind_port)
        conn_addr = 'mysql://' + user + ':' + password + '@' + db_server + ':' + port + '/' + db_name
        engine = create_engine(conn_addr)
        connection = engine.connect()
        print('Yeah! MySQL server connected using the SSH tunnel connection!')
    except:
        try:
            disconnect_mysql ()
        except:
            pass
        try:    
            shut_ssh_tunnel ()
        except:
            pass
        open_ssh_tunnel_and_mysql()
    
def create_table (dataframe, df_name=str):
    dataframe.to_sql(df_name, conn_addr, if_exists='replace', index=False)
    print('All done, Madam!')

def disconnect_mysql ():
    """Disconnect from MySQL server"""
    connection.close()  
    print('MySQL server is not connected anymore!')
    
def shut_ssh_tunnel ():
    """Stop the SSH tunnel"""
    server.stop()
    print("You've stopped the SSH tunnel!")

In [304]:
open_ssh_tunnel_and_mysql ()

create_table (attitude_eleves, 'elevesdf')
create_table (attitude_cours, 'coursdf')
create_table (classes, 'classesdf')
create_table (attitude_paiement, 'paimentsdf')

disconnect_mysql ()
shut_ssh_tunnel ()

Tunnel oppend :-P
Yeah! MySQL server connected using the SSH tunnel connection!
All done, Madam!
All done, Madam!
All done, Madam!
All done, Madam!
MySQL server is not connected anymore!
You've stopped the SSH tunnel!


In [291]:
attitude_eleves[attitude_eleves['Représentant légal de l’inscrit (pour les mineurs)'] != 0]

Unnamed: 0,Nom et prénom d'élève,Date de naissance,Adresse,Cité,Code Postale,E-mail,Téléphone,Représentant légal de l’inscrit (pour les mineurs)
1,Paulon Lily,24-07-2014,11 Rue Du Docteur Charles Bonneau,Toulouse,31000,do_julia@hotmail.com,+33 648949998,Paulon Julia
6,Gaiddon Emma,14-02-2013,88 Avenue Saint Exupery,Toulouse,31400,�orence@gaiddon.com,+33 0687505082,Laparliere Florence
7,Austruy Pandore,25-06-2014,"30 Rue Saint Luc, 30 Rue Saint Luc",Toulouse,31400,juliedischer@hotmail.fr,+33 668596316,Austruy Julien
8,Austruy Simone,12-11-2017,"30 Rue Saint Luc, 30 Rue Saint Luc",Toulouse,31400,juliedischer@hotmail.fr,+33 668596316,Austruy Julien
9,Guez Anne Valerie,19-11-1970,"Rue D’Alsace Lorrainr, 44",Toulouse,31000,gobati@hotmail.fr,6 08952343,Guez Anne Valerie
17,Lewin Fleur Rose,29-08-2013,34 Rue Monplaisir,Toulouse,31400,annelewin�eur@yahoo.fr,+33 611778334,Lewin Fleur Anne
19,Clamen Laborie Tessa,23-07-2011,"Rue Denis Diderot , 12 Bis",Toulouse,31400,sabineclamen@free.fr,+33 0698610102,Clamen Sabine
20,Bonnevialle Margaux,11-03-2007,"113 Avenue De Muret, 113 Avenue De Muret",Toulouse,31300,�obonnevialle@yahoo.fr,+33 629513691,Bonnevialle Florence
21,Pichon Venturini Apolline,13-12-2015,"Rue De La Gaieté , 28",Toulouse,31400,jessica.venturini@laposte.net,+33 0685304987,Venturini Jessica
22,De Roquette Buisson Maya,14-04-2016,"Péniche Face N°28 Bd Griffoul Dorval, Péniche ...",Griffoul Dorval,0,pauline.desars@act-avocats.com,6 20631980,De Roquette Buisson Pauline
