<a href="https://colab.research.google.com/github/emmanuelvaie/google_colab/blob/main/BigQueryAnalysis_profiles.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
# @title Setup
from google.colab import auth
from google.cloud import bigquery
from google.colab import data_table

project = 'fluent-music-364313' # Project ID inserted based on the query results selected to explore
location = 'europe-west9' # Location inserted based on the query results selected to explore
client = bigquery.Client(project=project, location=location)
data_table.enable_dataframe_formatter()
auth.authenticate_user()

In [2]:
query = """
SELECT * FROM `fluent-music-364313.import_boond.profiles` WHERE DATE(_PARTITIONTIME) = "2022-11-17"; 
"""
job = client.query(query)

## Reference SQL syntax from the original job
Use the ```jobs.query```
[method](https://cloud.google.com/bigquery/docs/reference/rest/v2/jobs/query) to
return the SQL syntax from the job. This can be copied from the output cell
below to edit the query now or in the future. Alternatively, you can use
[this link](https://console.cloud.google.com/bigquery?j=fluent-music-364313:europe-west9:bquxjob_681ca93c_184a4051b54)
back to BigQuery to edit the query within the BigQuery user interface.

In [3]:
print(job.query)


SELECT * FROM `fluent-music-364313.import_boond.profiles` WHERE DATE(_PARTITIONTIME) = "2022-11-17"; 



# Result set loaded from BigQuery job as a DataFrame
Query results are referenced from the Job ID ran from BigQuery and the query
does not need to be re-run to explore results. The ```to_dataframe```
[method](https://googleapis.dev/python/bigquery/latest/generated/google.cloud.bigquery.job.QueryJob.html#google.cloud.bigquery.job.QueryJob.to_dataframe)
downloads the results to a Pandas DataFrame by using the BigQuery Storage API.

To edit query syntax, you can do so from the BigQuery SQL editor or in the
```Optional:``` sections below.

In [32]:
# Running this code will read results from your previous job
results = job.to_dataframe()
results.head()



Unnamed: 0,Civilit__,Pr__nom,Nom,Type,Etape,Titre,Comp__tences,Evaluation_Globale,Date_de_Naissance,Nationalit__,...,CV1,CV2,CV3,CV4,CV5,Mobilit__,Langues,Domaines,Secteurs,Outils
0,M,Satyam,A,Consultant Externe,Vivier sourcing,Architect & Tech Lead Microsoft Azure,,Inconnue,,Inconnue,...,,,,,,Inconnue,,,,
1,M,Hicham,Ouali Alami,Consultant Externe,Vivier sourcing,Data Governance,,Inconnue,,Inconnue,...,,,,,,Inconnue,,,,
2,M,Hatem,Nasri,Consultant Externe,Vivier sourcing,Big Data eng,,Inconnue,,Inconnue,...,,,,,,Inconnue,,,,
3,M,Soukaina,Rommane,Consultant Externe,Vivier sourcing,Python,,Inconnue,,Inconnue,...,,,,,,Inconnue,,,,
4,M,Youcef,BOUNEKTA,Consultant Externe,Vivier sourcing,Devops,,Inconnue,,Inconnue,...,,,,,,Inconnue,,,,


## Show descriptive statistics using describe()
Use the ```pandas DataFrame.describe()```
[method](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html)
to generate descriptive statistics. Descriptive statistics include those that
summarize the central tendency, dispersion and shape of a dataset’s
distribution, excluding ```NaN``` values. You may also use other Python methods
to interact with your data.

In [33]:
results.describe(include='all')



Unnamed: 0,Civilit__,Pr__nom,Nom,Type,Etape,Titre,Comp__tences,Evaluation_Globale,Date_de_Naissance,Nationalit__,...,CV1,CV2,CV3,CV4,CV5,Mobilit__,Langues,Domaines,Secteurs,Outils
count,959,959,884,959,959,926,610,959,66,959,...,804,189,62,21,118,953,147,607,0.0,0.0
unique,1,538,739,2,9,506,530,5,53,3,...,689,184,49,19,102,22,17,19,0.0,0.0
top,M,Mohamed,Ahmed,Consultant Externe,Vivier,Devops,agile,Inconnue,19/08/1990,Inconnue,...,Sami Diramchi.pdf,https://sonatecloud-my.sharepoint.com/:w:/g/pe...,"CodinGame - C#,_.NET_platform_-_Expert - MALIK...",Grille d'évaluation Badr KACIMI.docx,Sonate_Hamza M.docx,Inconnue,", Anglais : Inconnue","RD,Devops,Cloud,Data",,
freq,959,34,4,926,344,25,2,601,2,949,...,2,2,2,2,3,887,71,269,,
mean,,,,,,,,,,,...,,,,,,,,,,
std,,,,,,,,,,,...,,,,,,,,,,
min,,,,,,,,,,,...,,,,,,,,,,
25%,,,,,,,,,,,...,,,,,,,,,,
50%,,,,,,,,,,,...,,,,,,,,,,
75%,,,,,,,,,,,...,,,,,,,,,,


In [34]:
cols = results.columns

In [35]:
import pandas as pd
prf = pd.DataFrame()

In [40]:
for c in cols:
  nb_null = results[c].isna().sum()
  freq = results[c].value_counts()
  d = pd.DataFrame(data = {'nom_col': c, 'nb_null': [nb_null], 'freq': [freq]})
  #prf = prf.append(d)
  prf = pd.concat([prf,d])


In [41]:
prf.head()

Unnamed: 0,nom_col,nb_null,freq
0,Civilit__,0,"M 959 Name: Civilit__, dtype: int64"
0,Pr__nom,0,Mohamed 34 Youssef 17 Ahmed ...
0,Nom,75,Ahmed 4 Sassi 3 BOUBAKER 3 DIAL...
0,Type,0,Consultant Externe 926 Consultant Interne ...
0,Etape,0,Vivier 344 No Go Sonat...


In [42]:
nb_records = len(results)
nb_records

959

In [43]:
prf['pct_null'] = prf['nb_null'].apply(lambda x : 100 * x/nb_records)

In [44]:
prf

Unnamed: 0,nom_col,nb_null,freq,pct_null
0,Civilit__,0,"M 959 Name: Civilit__, dtype: int64",0.0
0,Pr__nom,0,Mohamed 34 Youssef 17 Ahmed ...,0.0
0,Nom,75,Ahmed 4 Sassi 3 BOUBAKER 3 DIAL...,7.820647
0,Type,0,Consultant Externe 926 Consultant Interne ...,0.0
0,Etape,0,Vivier 344 No Go Sonat...,0.0
0,Titre,33,Devops 25 .Net ...,3.441084
0,Comp__tences,349,agile ...,36.392075
0,Evaluation_Globale,0,Inconnue 601 D 273 C 3...,0.0
0,Date_de_Naissance,893,19/08/1990 2 8. 11/2000 2 ...,93.117831
0,Nationalit__,0,Inconnue 949 Non Français 7 França...,0.0
