<a href="https://colab.research.google.com/github/emmanuelvaie/google_colab/blob/main/BigQueryAnalysis_profiles.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [7]:
# @title Setup
from google.colab import auth
from google.cloud import bigquery
from google.colab import data_table

project = 'fluent-music-364313' # Project ID inserted based on the query results selected to explore
location = 'europe-west9' # Location inserted based on the query results selected to explore
client = bigquery.Client(project=project, location=location)
data_table.enable_dataframe_formatter()
auth.authenticate_user()

In [8]:
query = """
SELECT * FROM import_boond.sonate_profiles where Ingestion_date='2022_12_05 14:17:44'; 
"""
job = client.query(query)

## Reference SQL syntax from the original job
Use the ```jobs.query```
[method](https://cloud.google.com/bigquery/docs/reference/rest/v2/jobs/query) to
return the SQL syntax from the job. This can be copied from the output cell
below to edit the query now or in the future. Alternatively, you can use
[this link](https://console.cloud.google.com/bigquery?j=fluent-music-364313:europe-west9:bquxjob_681ca93c_184a4051b54)
back to BigQuery to edit the query within the BigQuery user interface.

In [9]:
print(job.query)


SELECT * FROM import_boond.sonate_profiles where Ingestion_date='2022_12_05 14:17:44'; 



# Result set loaded from BigQuery job as a DataFrame
Query results are referenced from the Job ID ran from BigQuery and the query
does not need to be re-run to explore results. The ```to_dataframe```
[method](https://googleapis.dev/python/bigquery/latest/generated/google.cloud.bigquery.job.QueryJob.html#google.cloud.bigquery.job.QueryJob.to_dataframe)
downloads the results to a Pandas DataFrame by using the BigQuery Storage API.

To edit query syntax, you can do so from the BigQuery SQL editor or in the
```Optional:``` sections below.

In [10]:
# Running this code will read results from your previous job
results = job.to_dataframe()
results.head()



Unnamed: 0,Civilit__,Pr__nom,Nom,Type,Etape,Titre,Comp__tences,Evaluation_Globale,Date_de_Naissance,Nationalit__,...,CV2,CV3,CV4,CV5,Mobilit__,Langues,Domaines,Secteurs,Outils,Ingestion_date
0,M,Slim,SOUMRI,Consultant Externe,Vivier sourcing,freelancer java,"html, npm, css, typescript, sql, apache, oracl...",Inconnue,,Inconnue,...,,,,,Inconnue,,"RD, Cloud",,"html : 3, npm : 3, css : 3, typescript : 3, sq...",2022_12_05 14:17:44
1,M,Ahcene,OUKACI,Consultant Externe,Vivier sourcing,Développeur .NET fullstack,"html, bootstrap, css, typescript, sql, oracle,...",Inconnue,,Inconnue,...,,,,,Inconnue,,RD,,"html : 3, bootstrap : 3, css : 3, typescript :...",2022_12_05 14:17:44
2,M,Siddharth,PANDEY,Consultant Externe,Vivier sourcing,"Data Engineering, Cloud And DevOps",,Inconnue,,Inconnue,...,,,,,Inconnue,,,,,2022_12_05 14:17:44
3,M,Amine,El Harrak,Consultant Interne,Vivier sourcing,Dev FS Junior JAVA ANGULAR CDI,"digital marketing, react, java, angular, master",Inconnue,,Inconnue,...,,,,,Inconnue,,RD,,"digital marketing : 3, react : 3, java : 3, an...",2022_12_05 14:17:44
4,M,Ahmed,Ouiriemmi,Consultant Interne,Vivier sourcing,Indep ouvert au CDI .net/C#,"agile, html, css, sql, oracle, sql server, jav...",Inconnue,,Inconnue,...,,,,,Inconnue,,RD,,"agile : 3, html : 3, css : 3, sql : 3, oracle ...",2022_12_05 14:17:44


## Show descriptive statistics using describe()
Use the ```pandas DataFrame.describe()```
[method](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html)
to generate descriptive statistics. Descriptive statistics include those that
summarize the central tendency, dispersion and shape of a dataset’s
distribution, excluding ```NaN``` values. You may also use other Python methods
to interact with your data.

In [11]:
results.describe(include='all')



Unnamed: 0,Civilit__,Pr__nom,Nom,Type,Etape,Titre,Comp__tences,Evaluation_Globale,Date_de_Naissance,Nationalit__,...,CV2,CV3,CV4,CV5,Mobilit__,Langues,Domaines,Secteurs,Outils,Ingestion_date
count,978,978,902,978,978,945,881,978,79,978,...,186,38,9,3,972,157,775,34,881,978
unique,1,548,756,2,9,515,788,5,64,3,...,157,35,9,3,24,17,20,20,798,1
top,M,Mohamed,Ahmed,Consultant Externe,Vivier,Devops,go,Inconnue,04 juin 2005,Inconnue,...,Sonate_Hamza M.docx,"CodinGame - Docker,_Kubernetes,_Python_3_-_Sen...",CodinGame - Développeur_Python_-_Senior - Habi...,CodinGame - Développeur_Full_Stack_Django_-_Se...,Inconnue,", Anglais : Inconnue","RD, Data",banque assurance,angular : 3,2022_12_05 14:17:44
freq,978,34,4,945,345,25,6,619,2,968,...,3,2,1,1,899,78,196,5,5,978
mean,,,,,,,,,,,...,,,,,,,,,,
std,,,,,,,,,,,...,,,,,,,,,,
min,,,,,,,,,,,...,,,,,,,,,,
25%,,,,,,,,,,,...,,,,,,,,,,
50%,,,,,,,,,,,...,,,,,,,,,,
75%,,,,,,,,,,,...,,,,,,,,,,


In [12]:
cols = results.columns

In [13]:
import pandas as pd
prf = pd.DataFrame()

In [14]:
for c in cols:
  nb_null = results[c].isna().sum()
  freq = results[c].value_counts()
  d = pd.DataFrame(data = {'nom_col': c, 'nb_null': [nb_null], 'freq': [freq]})
  #prf = prf.append(d)
  prf = pd.concat([prf,d])


In [15]:
prf.head()

Unnamed: 0,nom_col,nb_null,freq
0,Civilit__,0,"M 978 Name: Civilit__, dtype: int64"
0,Pr__nom,0,Mohamed 34 Youssef 18 Hamza 14 A...
0,Nom,76,Ahmed 4 MOHAMED ...
0,Type,0,Consultant Externe 945 Consultant Interne ...
0,Etape,0,Vivier 345 No Go Sonat...


In [16]:
nb_records = len(results)
nb_records

978

In [17]:
prf['pct_null'] = prf['nb_null'].apply(lambda x : 100 * x/nb_records)

In [18]:
prf

Unnamed: 0,nom_col,nb_null,freq,pct_null
0,Civilit__,0,"M 978 Name: Civilit__, dtype: int64",0.0
0,Pr__nom,0,Mohamed 34 Youssef 18 Hamza 14 A...,0.0
0,Nom,76,Ahmed 4 MOHAMED ...,7.770961
0,Type,0,Consultant Externe 945 Consultant Interne ...,0.0
0,Etape,0,Vivier 345 No Go Sonat...,0.0
0,Titre,33,Devops 25 Data E...,3.374233
0,Comp__tences,97,go ...,9.9182
0,Evaluation_Globale,0,Inconnue 619 D 273 C 3...,0.0
0,Date_de_Naissance,899,04 juin 2005 2 31/07/1989 2 04/10/1981...,91.92229
0,Nationalit__,0,Inconnue 968 Non Français 7 França...,0.0
