<a href="https://colab.research.google.com/github/emmanuelvaie/google_colab/blob/main/BigQueryAnalysis_profiles.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
# @title Setup
from google.colab import auth
from google.cloud import bigquery
from google.colab import data_table

project = 'fluent-music-364313' # Project ID inserted based on the query results selected to explore
location = 'europe-west9' # Location inserted based on the query results selected to explore
client = bigquery.Client(project=project, location=location)
data_table.enable_dataframe_formatter()
auth.authenticate_user()

In [5]:
query = """
select * from import_boond.profiles_25; 
"""
job = client.query(query)

## Reference SQL syntax from the original job
Use the ```jobs.query```
[method](https://cloud.google.com/bigquery/docs/reference/rest/v2/jobs/query) to
return the SQL syntax from the job. This can be copied from the output cell
below to edit the query now or in the future. Alternatively, you can use
[this link](https://console.cloud.google.com/bigquery?j=fluent-music-364313:europe-west9:bquxjob_681ca93c_184a4051b54)
back to BigQuery to edit the query within the BigQuery user interface.

In [6]:
print(job.query)


select * from import_boond.profiles_25; 



# Result set loaded from BigQuery job as a DataFrame
Query results are referenced from the Job ID ran from BigQuery and the query
does not need to be re-run to explore results. The ```to_dataframe```
[method](https://googleapis.dev/python/bigquery/latest/generated/google.cloud.bigquery.job.QueryJob.html#google.cloud.bigquery.job.QueryJob.to_dataframe)
downloads the results to a Pandas DataFrame by using the BigQuery Storage API.

To edit query syntax, you can do so from the BigQuery SQL editor or in the
```Optional:``` sections below.

In [7]:
# Running this code will read results from your previous job
results = job.to_dataframe()
results.head()



Unnamed: 0,Civilit__,Pr__nom,Nom,Type,Etape,Titre,Comp__tences,Evaluation_Globale,Date_de_Naissance,Nationalit__,...,CV2,CV3,CV4,CV5,Mobilit__,Langues,Domaines,Secteurs,Outils,Ingestion_date
0,M,Marwa,MOHAMED,Consultant Externe,Vivier sourcing,ISMAIl,,Inconnue,,Inconnue,...,,,,,Inconnue,,,,,2022-11-25
1,M,Sabri,Boussetha,Consultant Externe,Vivier sourcing,Cloud/Data Engineer,,Inconnue,,Inconnue,...,,,,,Inconnue,,,,,2022-11-25
2,M,Satyam,A,Consultant Externe,Vivier sourcing,Architect & Tech Lead Microsoft Azure,,Inconnue,,Inconnue,...,,,,,Inconnue,,,,,2022-11-25
3,M,Hicham,Ouali Alami,Consultant Externe,Vivier sourcing,Data Governance,,Inconnue,,Inconnue,...,,,,,Inconnue,,,,,2022-11-25
4,M,Hatem,Nasri,Consultant Externe,Vivier sourcing,Big Data eng,,Inconnue,,Inconnue,...,,,,,Inconnue,,,,,2022-11-25


## Show descriptive statistics using describe()
Use the ```pandas DataFrame.describe()```
[method](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html)
to generate descriptive statistics. Descriptive statistics include those that
summarize the central tendency, dispersion and shape of a dataset’s
distribution, excluding ```NaN``` values. You may also use other Python methods
to interact with your data.

In [8]:
results.describe(include='all')



Unnamed: 0,Civilit__,Pr__nom,Nom,Type,Etape,Titre,Comp__tences,Evaluation_Globale,Date_de_Naissance,Nationalit__,...,CV2,CV3,CV4,CV5,Mobilit__,Langues,Domaines,Secteurs,Outils,Ingestion_date
count,968,968,893,968,968,935,617,968,66,968,...,189,23,0.0,109,962,151,566,33,617,968
unique,1,544,747,2,9,511,536,5,53,3,...,184,19,0.0,93,24,17,19,20,542,1
top,M,Mohamed,Ahmed,Consultant Externe,Vivier,Devops,"github, access, gitlab, architecture, sql, gra...",Inconnue,19/08/1990,Inconnue,...,https://sonatecloud-my.sharepoint.com/:w:/g/pe...,gs://spm-datalake/cvs/date_insertion=2022-11-2...,,gs://spm-datalake/cvs/date_insertion=2022-11-2...,Inconnue,", Anglais : Inconnue","RD, Data",engie,"scala : 5, spark : 5, big data : 5, operations...",2022-11-25
freq,968,34,4,935,344,25,2,610,2,958,...,2,2,,3,890,74,159,4,2,968
mean,,,,,,,,,,,...,,,,,,,,,,
std,,,,,,,,,,,...,,,,,,,,,,
min,,,,,,,,,,,...,,,,,,,,,,
25%,,,,,,,,,,,...,,,,,,,,,,
50%,,,,,,,,,,,...,,,,,,,,,,
75%,,,,,,,,,,,...,,,,,,,,,,


In [9]:
cols = results.columns

In [10]:
import pandas as pd
prf = pd.DataFrame()

In [11]:
for c in cols:
  nb_null = results[c].isna().sum()
  freq = results[c].value_counts()
  d = pd.DataFrame(data = {'nom_col': c, 'nb_null': [nb_null], 'freq': [freq]})
  #prf = prf.append(d)
  prf = pd.concat([prf,d])


In [12]:
prf.head()

Unnamed: 0,nom_col,nb_null,freq
0,Civilit__,0,"M 968 Name: Civilit__, dtype: int64"
0,Pr__nom,0,Mohamed 34 Youssef 18 Ahmed 14 H...
0,Nom,75,Ahmed 4 Aloui 3 CHELLY ...
0,Type,0,Consultant Externe 935 Consultant Interne ...
0,Etape,0,Vivier 344 No Go Sonat...


In [13]:
nb_records = len(results)
nb_records

968

In [14]:
prf['pct_null'] = prf['nb_null'].apply(lambda x : 100 * x/nb_records)

In [15]:
prf

Unnamed: 0,nom_col,nb_null,freq,pct_null
0,Civilit__,0,"M 968 Name: Civilit__, dtype: int64",0.0
0,Pr__nom,0,Mohamed 34 Youssef 18 Ahmed 14 H...,0.0
0,Nom,75,Ahmed 4 Aloui 3 CHELLY ...,7.747934
0,Type,0,Consultant Externe 935 Consultant Interne ...,0.0
0,Etape,0,Vivier 344 No Go Sonat...,0.0
0,Titre,33,Devops 25 .Net ...,3.409091
0,Comp__tences,351,"github, access, gitlab, architecture, sql, gra...",36.260331
0,Evaluation_Globale,0,Inconnue 610 D 273 C 3...,0.0
0,Date_de_Naissance,902,19/08/1990 2 08/12/1986 2 ...,93.181818
0,Nationalit__,0,Inconnue 958 Non Français 7 França...,0.0
