<a href="https://colab.research.google.com/github/emmanuelvaie/google_colab/blob/main/EDA_of_export_boond_database.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
# @title Setup
from google.colab import auth
from google.cloud import bigquery
from google.colab import data_table

project = 'fluent-music-364313' # Project ID inserted based on the query results selected to explore
location = 'europe-west9' # Location inserted based on the query results selected to explore
client = bigquery.Client(project=project, location=location)
data_table.enable_dataframe_formatter()
auth.authenticate_user()

In [2]:
query = """
SELECT * FROM  `fluent-music-364313`.export_boond.candidates; 
"""
job = client.query(query)

## Reference SQL syntax from the original job
Use the ```jobs.query```
[method](https://cloud.google.com/bigquery/docs/reference/rest/v2/jobs/query) to
return the SQL syntax from the job. This can be copied from the output cell
below to edit the query now or in the future. Alternatively, you can use
[this link](https://console.cloud.google.com/bigquery?j=fluent-music-364313:europe-west9:bquxjob_681ca93c_184a4051b54)
back to BigQuery to edit the query within the BigQuery user interface.

In [3]:
print(job.query)


SELECT * FROM  `fluent-music-364313`.export_boond.candidates; 



# Result set loaded from BigQuery job as a DataFrame
Query results are referenced from the Job ID ran from BigQuery and the query
does not need to be re-run to explore results. The ```to_dataframe```
[method](https://googleapis.dev/python/bigquery/latest/generated/google.cloud.bigquery.job.QueryJob.html#google.cloud.bigquery.job.QueryJob.to_dataframe)
downloads the results to a Pandas DataFrame by using the BigQuery Storage API.

To edit query syntax, you can do so from the BigQuery SQL editor or in the
```Optional:``` sections below.

In [4]:
# Running this code will read results from your previous job
results = job.to_dataframe()
results.head()



Unnamed: 0,Reference_interne,Civilite,Nom,Prenom,Type,Titre,Competences,Domaines_d_application,Formation,Experience,...,Date_de_mise_a_jour,Commentaires__informations_,Langues,Diplomes,Secteurs,Ressource___Reference_interne,Nombre_de_CV,Etape___motif,Motif___Commentaire,version
0,CAND2635,M.,Mirzaian,Vincent,Consultant Externe,EXPERT IAM,,,,,...,2023-02-24 10:32:00+00:00,,,2015 - ingénieurie informatique - EPITECH,,,1,,,2023-02-27 16:47:38+00:00
1,CAND2639,M.,Ben Mariem,Abderrahmen,Non défini,"Architecte, Trainer et Expert JAVA/JEE, Angula...","Java, XML, jQuery",,,,...,2023-02-27 15:44:00+00:00,,,"ENSI - Diplôme d'ingénieur, informatique - 200...",,,1,,,2023-02-27 16:47:38+00:00
2,CAND2640,M.,SCOFFIER,Antoine,Non défini,DevOps Engineer | Freelance,"Jenkins, Python, Software craftsmanship",,,,...,2023-02-27 15:46:00+00:00,,,"ETNA, école d'alternance en informatique - Mas...",,,1,,,2023-02-27 16:47:38+00:00
3,CAND2637,M.,ALLOUKH,Nabil,Consultant Externe,Développeur .Net senior,"angular, react, c#, f#, python, javascript, ty...",,,,...,2023-02-24 16:12:00+00:00,,Anglais (),2008 - cycle ingénierie en informatique - I.M....,,,1,,,2023-02-27 16:47:38+00:00
4,CAND2612,M.,THIERRY,Alexis,Non défini,Architect,,,,,...,2023-02-23 18:28:00+00:00,,,,,,0,,,2023-02-27 16:47:38+00:00


## Show descriptive statistics using describe()
Use the ```pandas DataFrame.describe()```
[method](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html)
to generate descriptive statistics. Descriptive statistics include those that
summarize the central tendency, dispersion and shape of a dataset’s
distribution, excluding ```NaN``` values. You may also use other Python methods
to interact with your data.

In [5]:
results.describe(include='all')



  results.describe(include='all')


Unnamed: 0,Reference_interne,Civilite,Nom,Prenom,Type,Titre,Competences,Domaines_d_application,Formation,Experience,...,Date_de_mise_a_jour,Commentaires__informations_,Langues,Diplomes,Secteurs,Ressource___Reference_interne,Nombre_de_CV,Etape___motif,Motif___Commentaire,version
count,5449,5449,5449,5449,5449,5443,5243,4977,5220,5204,...,5449,250,5155,3576,4953,26,5449.0,0.0,0.0,5449
unique,922,2,873,630,3,150,890,21,3,5,...,283,52,26,598,4,7,,,,6
top,CAND2257,M.,A CONFIRMER,mohamed,Consultant Externe,Développeur .Net,J'attends ses dispo,"RD,Data",Master,6 à 10 ans,...,2023-02-01 09:37:00+00:00,Mobile sur IDF\nIl est d'accord pour 2 jours d...,Français (Courant),Master,Autres,COMP2592,,,,2023-02-27 16:47:38+00:00
freq,6,5434,72,162,4425,1140,12,1026,5154,1461,...,1022,6,4656,30,4938,6,,,,922
first,,,,,,,,,,,...,2023-01-11 15:50:00+00:00,,,,,,,,,2023-02-16 16:52:13+00:00
last,,,,,,,,,,,...,2023-02-27 16:33:00+00:00,,,,,,,,,2023-02-27 16:47:38+00:00
mean,,,,,,,,,,,...,,,,,,,1.040374,,,
std,,,,,,,,,,,...,,,,,,,0.615678,,,
min,,,,,,,,,,,...,,,,,,,0.0,,,
25%,,,,,,,,,,,...,,,,,,,1.0,,,


In [6]:
cols = results.columns

In [7]:
import pandas as pd
prf = pd.DataFrame()

In [8]:
for c in cols:
  nb_null = results[c].isna().sum()
  freq = results[c].value_counts()
  d = pd.DataFrame(data = {'nom_col': c, 'nb_null': [nb_null], 'freq': [freq]})
  #prf = prf.append(d)
  prf = pd.concat([prf,d])


In [9]:
prf.head()

Unnamed: 0,nom_col,nb_null,freq
0,Reference_interne,0,CAND2257 6 CAND2270 6 CAND2259 6 CAND...
0,Civilite,0,"M. 5434 Mme. 15 Name: Civilite, dtyp..."
0,Nom,0,A CONFIRMER 72 SASSI 24 ZOUAOUI ...
0,Prenom,0,mohamed 162 youssef 90 hamza ...
0,Type,0,Consultant Externe 4425 Consultant Interne ...


In [10]:
nb_records = len(results)
nb_records

5449

In [11]:
prf['pct_null'] = prf['nb_null'].apply(lambda x : 100 * x/nb_records)

In [12]:
prf

Unnamed: 0,nom_col,nb_null,freq,pct_null
0,Reference_interne,0,CAND2257 6 CAND2270 6 CAND2259 6 CAND...,0.0
0,Civilite,0,"M. 5434 Mme. 15 Name: Civilite, dtyp...",0.0
0,Nom,0,A CONFIRMER 72 SASSI 24 ZOUAOUI ...,0.0
0,Prenom,0,mohamed 162 youssef 90 hamza ...,0.0
0,Type,0,Consultant Externe 4425 Consultant Interne ...,0.0
0,Titre,6,Développeur .Net ...,0.110112
0,Competences,206,J'attends ses dispo ...,3.78051
0,Domaines_d_application,472,"RD,Data 1026 Inconnu ...",8.66214
0,Formation,229,Master 5154 Grande école 60 Bac ...,4.202606
0,Experience,245,6 à 10 ans 1461 Pas d'expérience 1...,4.496238
