# 📊 Customer Churn Analysis with BigQuery & Python 🚀

## **Introduction**

In this notebook, we will analyze customer churn data using Google BigQuery and Python. The dataset contains information about customers. the varibles in the dataset are:

1. customer_id
2. credit_score 
3. country
4. gender
5. age
6. tenure
7. balance
8. products_number
9. credit_card
10. active_member
11. estimated_salary
12. churn, used as the target. 1 if the client has left the bank during some period or 0 if he/she has not.

In this notebook we will set up the connection between kaggle and BigQuery, load the data from BigQuery, and perform some basic data analysis. We will also visualize the data using matplotlib and seaborn.

## **Import Libraries**

For this first part we will use `os` to set the environment variables, `pandas` to manipulate the data and `google.cloud` to connect to BigQuery.

In [1]:
import os
import pandas as pd
from google.cloud import bigquery

## Setting up Service Account Credentials

For this project, I wanted to learn how to manage kaggle datasets using bigQuery.Turns out GCP provides service acounts, which can be used to access BigQuery datasets.

First we have to set the environment variables for the service account credentials.

I decided to do this by downloading the key directly from GCP, and saving it on the current codespaces virtual machine I'm using for development. For this process to be secure, I added the folder where this key is stored to the .gitignore file, so it won't be uploaded to GitHub. 

Then, let's set an environment variable where the path to the .json file is saved.


In [2]:
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "/workspaces/ABCBankChurnRate/.config/sa_credentials.json"

## Initializing BigQuery Client

Now, a variable called `client` which is an instance of the `google.cloud.bigquery.Client` class, has it's own methods and attributes. One of it's attributes is `.project`. Printing it will confirm that the client is authenticated, and connected to the GCP project I set up for this data set.

In [3]:
client = bigquery.Client()
print(client.project)

kagglebigquerybankchurn


This `client` object will allow us to perform operations like:

- Running SQL queries on BigQuery Datasets (My main goal on pursuing this specific path)
- Creating datasets and tables (this will be needed to fetch)
- fetch query results as `pandas` DataFrames.

## Fetching Dataset From Kaggle


Now, this dataset is avaliable at kaggle [here](https://www.kaggle.com/datasets/gauravtopre/bank-customer-churn-dataset). I selected this dataset because the dependent variable is categorical. From my background in analytical chemistry I'm used to work with continuous dependent variables. Then, this is a great oportunity for me to learn how these different systems behave.

let's then import the `kaggle API` first:

In [4]:
from kaggle.api.kaggle_api_extended import KaggleApi
import glob

`glob` will be used to handle the datasets temporarily. I think datasets should not be saved in repositories for efficiency, so we well see below how to address this.

Now, we will temporarily download the dataset using my 

## Uploading dataset to bigquery

In the Google Data Analytics Professional certificate, I had the chance to work with bigquery and I found it amazing. From those times I remember each table is within a dataset, and it should have it's own unique identifier. We will need that now, so let's set it up.

In the following code we will define the `dataset_id` and `table_id` our kaggle dataset will have inside bigquery, and we will use to run our queries.

In [5]:
project_id = client.project # This is already defined by the service account
dataset_id = f"{project_id}.churn_analysis"
table_id = "Kaggle_churn"
full_table_id = f"{dataset_id}.{table_id}"

Now that we have defined that unique identifier for the data table, and the dataset whithin it will be saved, we can go ahead and create the dataset:

In [6]:
client.create_dataset(dataset_id, exists_ok=True)

Dataset(DatasetReference('kagglebigquerybankchurn', 'churn_analysis'))

I've been trough this process many times inside the bigquery site, and every time I felt like selecting options from drop down menus, and pushing buttons was not as reproducible as I would like it to be. Now I'm glad it can be written down in code and anyone can use this for their needs (assuming someone besides me reads this 🤣). 


Now check if there is any tables within the dataset. I think there should not because we have not used the kaggle API yet:

In [7]:
tables = {table.table_id for table in client.list_tables(dataset_id)}
print(tables)

set()


Ok this set comprehension (which I find tremendously efficient) loops trough all the tables within the dataset named `dataset_id`. indeed since right now we haven't used the kaggle functions there is no table yet. Let's get to that.