<a href="https://colab.research.google.com/github/annaskaroliny/latamAirlines/blob/main/%5BCONFIDENTIAL%5D_TW_LATAM_Allocation_Process_Preparation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **LATAM Preparation - Complete Guide**

**Welcome to the Data Engineer Guide.** On this occasion, you will have the opportunity to get closer to the reality of the role, demonstrate your skills and knowledge in data processing with Python and different data structures.

This guide has all the tips and recommendations to succesfully crack all the steps required for LATAM allocation process. It is a reflection of the preparation and strategy that I took to embrace and crack challenges for LATAM. Enjoy!!

**This document is for TWERS use only and it contains confidential information.**



# **Hackerrank Test preparation**

My approach to crack the Hackerrank challenge was based on the Topics that are covered, using Gemini and the tips and recommendations I could get from other TWERS.

* The HackerRank test consists of 23 questions in total. 20 of unique and multiple selection. And 3 python coding questions.
* The test topics are described in the last section of the Gemini input.
* This is the input I created for him (consider that if the answer is too large, you will need to ask Gemini to continue the explanation) - Please don't use it only but read it as well because it has everything I could get in order to be prepared:


```
I have to prepare for a hackerrank test and that there are questions that can be general theory and not only related to GCP.
It will be cool to have both views.

I know that they ask for monolite, legacy, soa, design patterns, google api, software engineering patterns, architecture.
As far as I know the questions can be a definition or well giving you context to see what is best in that question.
Yet, that is what I know and there can be another questions as well.

I need as much as possible business case questions and answers (related to airlines and other domains) as well as concept questions and answers, all based on hackerrank preferently, within the list of topics that I will give to you.
I want all the questions and answers you can find.
I have time to read everything so it doesn't matter if the answer is large.

There are 20 questions of unique and multiple selection and I have 1.5 hours to solve them, from the following list with the format "Topic: What we waiting for?" -- please give and explain everything you have, for example if you suggest something, explain what it is or how to do it, in other words, don't make me going to another webpages nor asking you things back, don't suggest doing things, do them yourself. If something requires an example you give it.
If code is needed use python and if you need sql use sql.

Consider that I don't know anything related to this so if you use a professional term, you have to explain it and give examples if necessary:
- GIT (1 question): Good versioning practices, ability to work as a team with distributed responsibilities, maintaining orderly and controlled code
- Data structure & algorithms (4 questions): How data is organized and stored efficiently & Step-by-step methods for solving problems efficiently.
- CI/CD (2 questions): Experience in deployment and integration platforms within the development cycle
- IO (1 question): Manipulation of process input and output data optimally
- Cloud (3 questions): Experience in cloud development and resource integration
- Design and architecture patterns (5 questions): Ability to recognize/implement patterns to solve common problems in software development
- Programming paradigms and development strategies (4 questions): Principles or styles for creating software and some common strategies for development
```

* TIP: For python development questions (3) you can exercise in Hackerrank: arrays, HTML parsing (investigate BeautifulSoup) and str.

# **Data Ingest, Storage and Processing with Google Drive, Google Cloud Storage and Google BigQuery in Google Colab (Jupyter) with Python 3.10**

**Preparation:**

1. Initial project commit (done with GitHub desktop)
2. Install Git Flow with `brew install git-flow`
3. Configure the repository with `git flow init`
4. Configure feature finishes to be done only in develop with `git config gitflow.feature.finish.keepremote true`
5. Configure releases to not directly merge on main with `git config gitflow.feature.finish.keepremote=true`
6. Synchronize the repository with GDrive
7. Read the code from GDrive with Colab

**Additional Notes:**

* The `README.md` file mentions using GitHub Desktop, Git Flow, and Colab. These are tools that can be used for version control and code collaboration.
* The `README.md` file also mentions measuring time and memory. This can be done using Python's built-in `time` and `memory_profiler` modules.
* English was used for both documentation and code.

## **Challenge Guidelines:**

**Repository:**

* Your solution must be in a public repository on the GitHub platform.

**Submitting your challenge:**

1. Make a POST request to [https://advana-challenge-check-api-cr-k4hdbggvoq-uc.a.run.app/data-engineer](https://advana-challenge-check-api-cr-k4hdbggvoq-uc.a.run.app/data-engineer).
2. The request body should be a JSON object with the following fields:
    * `name`: Your full name
    * `mail`: Your email address
    * `github_url`: The URL of your GitHub repository containing the solution

**Deadline:**

* The deadline for submitting the challenge is 5 calendar days after receiving the challenge.

**Technology and Techniques:**

* You can use any technology or technique you prefer for data processing.
* We will value your knowledge of cloud platforms.
* If you use cloud platforms, follow the steps in your files WITHOUT adding access credentials to the different services.

**Ranking Criteria:**

* Challenges that are clearly organized, explanatory, modular, efficient, and creative will be ranked higher.

**Assumptions and Documentation:**

* Write down the assumptions you are making.
* Include the versions of the libraries you are using in the requirements.txt file.
* Do not delete what is already written in the requirements.txt file.
* For this challenge, we recommend that you clearly describe how each part of your exercise can be improved.

**Data:**

* You must use the data contained in the provided file.
* You can use the official Twitter documentation to understand the data structure.

**Git Usage:**

* We will positively evaluate good practices of Git usage.
* Use the main branch for any final version you want us to review.
* We recommend that you use some GitFlow practice.
* Do not delete your development branches.

**Error Handling and Edge Cases:**

* Consider error handling and edge cases.

**Maintainability, Readability, and Scalability:**

* Remember that you will be working with other developers, so the maintainability, readability, and scalability of your code is essential.

**Code Documentation:**

* Good code documentation always helps the reader.

**Additional Notes:**

* The `README.md` file mentions using GitHub Desktop, Git Flow, and Colab. These are tools that can be used for version control and code collaboration.
* The `README.md` file also mentions measuring time and memory. This can be done using Python's built-in `time` and `memory_profiler` modules.
* `# type: ignore` was used for thois libraries that exists within the google colab environment by default and they are not required in the `requirements.txt` file

# **Beginning of Challenge Code**
My approach was to firstly have a snippet with all the imports required for the entire notebook. Remember that if you want to load python files in your project you can use the magic command `%run file.py`.

Remember to always comment or document your assumptions so others can understand the way you think things!!

Here is an example of initial imports:

```
# Import the 'sys' library for interacting with the interpreter
import sys

# Exit the script if it's not the main program (avoids unintended execution as a module)
if __name__ != "__main__":
    sys.exit()

# Import general-purpose libraries:
import logging  # For logging messages
import os  # For interacting with the operating system
import time  # For working with time-related functions

# Import type-related libraries:
from typing import List, Tuple  # For type annotations
import datetime  # For working with dates and times

# Import I/O-related libraries:
import io  # For common I/O operations

# Import Google Colab-specific libraries:
from google.colab import drive  # For interacting with Google Drive

# Print a message indicating successful library imports
print("All initial libraries were imported")

```



## **Definitions and Configurations:**
This code snippet defines constants used in the data transfer and processing pipeline:

- **Google Cloud Storage (GCS) Information:**
    - `BUCKET_NAME`: Specifies the name of the GCS bucket where data will be uploaded (`tw-gcp-public-lab`).
    - `FOLDER_NAME`: Denotes the folder within the bucket to store the uploaded file (`raw`).
    - `ZIP_FILE_NAME`: Represents the name of the compressed file containing tweets data (`tweets.json.zip`).
    - `GCS_SOURCE_URI`: Constructs the full URI for the file location in GCS after upload (`gs://tw-gcp-public-lab/raw/`).

- **Local File Paths:**
    - `SOURCE_PATH`: Currently defines a local file path (`/content/drive/Othercomputers/My Mac/latam-challenge`), but it's not used in the provided code for downloading.

- **Google Cloud Project and Dataset Information:**
    - `PROJECT_ID`: Specifies the Google Cloud project ID (`tw-techdash`).
    - `DATASET_NAME`: Defines the name of the BigQuery dataset where the data will be loaded (`tweets_dataset`).
    - `TABLE_NAME`: Identifies the name of the BigQuery table to store the extracted tweets data (`tweets`).

**Observations:**

- The `SOURCE_PATH` might require modification if you intend to download a file from a different location.
- Consider using environment variables or a configuration file to manage these constants, making your code more flexible and easier to maintain.

## **Jupyter Kernel code reloading**

**Functionality:**

* This code snippet utilizes magic commands within Jupyter Notebooks to manage code reloading.
* The `%reload_ext autoreload` line imports and activates the `autoreload` extension.
* The `%autoreload 2` line configures the `autoreload` extension to automatically reload Python modules when changes are detected.

**Key Concepts:**

* **Jupyter Magic Commands:** `%` prefix is used for magic commands that provide special functionality within Jupyter notebooks.
* **Autoreload Extension:**  A Jupyter extension that automatically reloads Python modules when changes are detected in the corresponding source files.
* **Reload Level:** The level `2` specifies that reload should occur when source files or any imported modules are modified (level 1 only reloads source file changes).

**Overall Assessment:**

* This code improves development efficiency within Jupyter notebooks by automatically reloading code, avoiding manual restarts.
* It leverages the `autoreload` extension for automatic reloading functionality.
* The configuration level `2` ensures comprehensive reloading behavior.

**Potential Enhancements:**

* While automatic reloading is helpful in development, it might not be suitable for production environments due to potential unexpected behavior during execution.
* Consider using this approach primarily for interactive development within Jupyter notebooks.

## **Google Drive mounting**

**Functionality:**

1. **Connects Google Drive:** This code establishes a connection between your Google Drive storage and the virtual machine running the Colab notebook.
2. **Navigates to Project Directory:** This magic command changes the working directory within the Colab notebook to a specific location within your project directory.

**Key Concepts:**

* **Google Drive Mounting:**
    - `from google.colab import drive`: Imports the `drive` module for interacting with Google Drive from Colab.
    - `drive.mount('/content/drive', force_remount=True)`: Mounts your Drive at the `/content/drive` path within Colab.
    - **Authorization:** Requires initial authorization to grant Colab access to your Drive.
* **Jupyter Notebook Magic Commands:**
    - `%cd`: A magic command specifically designed for changing directories.

**Overall Assessment:**

* **Convenient Data Access:** Enables seamless access to your personal data stored in Google Drive for use within Colab notebooks.
* **Improved Code Organization:** Helps organize your notebook within the project structure by focusing on a specific subdirectory (like "src").

**Potential Enhancements:**

* **Google Drive Mounting:**
    - **Error Handling:** Consider incorporating `try-except` blocks to gracefully handle potential mounting issues.
    - **Authentication Persistence:** Explore ways to persist the authentication token (if applicable) to avoid re-authorization for every session.
* **Navigation:**
    - **Clear Path Definitions:** Replace `{SOURCE_PATH}` with the actual path to your project directory for clarity.
    - **Error Handling:** Consider handling potential issues like non-existent directories using Python code (like `try-except` blocks).

**Explanation:**

1. **Mount Google Drive:** The first part of the code imports the `drive` module and mounts your Google Drive to the `/content/drive` directory within Colab. This allows you to access your Drive files from within your notebook.
2. **Change Directory:** The `%cd {SOURCE_PATH}/src` line uses a magic command to navigate to the subdirectory named "src" within your project directory (assuming `{SOURCE_PATH}` points to the correct location). This helps organize your notebook by focusing on the relevant project code.

**Important Notes:**

* Replace `{SOURCE_PATH}` with the actual path to your project directory on your machine.
* You'll need to go through an authorization process the first time you run the mounting code to grant Colab access to your Drive.
* It was not possible to have this

## **Libraries requirements and imports**

**Functionality:**

- **Installs Python Libraries:** This code snippet installs a collection of Python libraries listed in a file named `requirements.txt` within the currently active virtual environment.

**Key Concepts:**

- **requirements.txt File:** This text file contains a list of library names and their version requirements, ensuring consistent installation across environments.
- **Virtual Environments:** Virtual environments isolate project dependencies, preventing conflicts with other Python projects on your system.
- **sys.executable:** This Python variable points to the path of the Python interpreter for the active virtual environment.
- **pip:** The Python Package Installer (pip) is used for managing Python packages and libraries.

**Explanation:**

1. **`import sys`**: Imports the `sys` module, providing access to system-specific variables and functions.
2. **`!{sys.executable} -m pip install -r '../requirements.txt'`**: This line calls the pip installer within the virtual environment:
   - **`!`**: Jupyter Notebook magic command to execute terminal commands.
   - **`{sys.executable}`**: Ensures pip is called from the virtual environment's Python interpreter.
   - **`-m`**: Designates a module to execute as a script (in this case, `pip`).
   - **`install -r`**: Instructs pip to install packages from a requirements file.
   - **`'../requirements.txt'`**: Specifies the path to the requirements file (relative to the current notebook's directory).

**Important Notes:**

- **Virtual Environment Activation:** Ensure you've activated the desired virtual environment before running this code.
- **Path to requirements.txt:** Verify that `../requirements.txt` correctly points to the file's location.
- **Internet Connection:** An internet connection is required for pip to download and install packages.

**Overall Assessment:**

- **Efficient Dependency Management:** Using `requirements.txt` is a best practice for managing project dependencies consistently.
- **Consistent Environments:** Facilitates consistent library installations across different machines for reproducibility.
- **Collaboration:** Enables easy setup of the same project environment for others.

**Potential Enhancements:**

- **Error Handling:** Consider incorporating error handling (like try-except blocks) to gracefully handle potential issues during installation, such as network connectivity problems or missing packages.


## **Ingest Google Drive ZIP into Google Storage**
**Key Functions:**

1. **authenticate_google_drive()**: Authenticates with Google Drive using the user's credentials.
2. **download_file_from_drive(drive_service, file_id)**: Downloads a specified file from Google Drive.
3. **upload_file_to_cloud_storage(client, bucket_name, folder_name, downloaded, zip_file_name)**: Uploads a file to Google Cloud Storage, handling folder creation if needed.
4. **decompress_zip_file(client, bucket_name, folder_name, zip_file_name)**: Decompresses a ZIP file within a GCS bucket.

**Code Structure:**

- **Logging:** Employs `logging` for debugging and tracking progress.
- **Error Handling:** Uses try-except blocks to gracefully handle potential errors.
- **Modularity:** Separates functionality into distinct, reusable functions.
- **Type Hints:** Enhances code readability and potential type checking.

**Main Code Execution:**

1. Configures logging to a file named 'transfer.log'.
2. Authenticates with Google Drive.
3. Downloads the specified file from Drive.
4. Creates a Cloud Storage client.
5. Uploads the downloaded file to GCS.
6. Decompresses the ZIP file in GCS if its content type is 'application/zip'.
7. Logs success or failure messages.
8. Finally, ensures the downloaded file is closed.

**Overall Assessment:**

- **Well-structured:** The code is organized, modular, and includes error handling.
- **Clear Functionality:** It effectively handles file transfer and decompression tasks.
- **Authentication Flexibility:** Uses authentication methods external to the code (useful for avoiding credentials in code).
- **Good Practices:** Adheres to good practices like logging and try-except blocks.

**Potential Enhancements:**

- **Parameterization:** Explore using command-line arguments or configuration files to adjust parameters more flexibly.
- **Progress Reporting:** Consider more granular progress reporting for downloads/uploads.
- **Content Validation:** Validate file content after decompression for integrity.
- **Advanced Error Handling:** Implement retries or alternative actions for potential errors.

This code provides a foundation for file transfer and decompression tasks within Google Cloud environments, demonstrating clarity and attention to best practices.

## **BigQuery Storage Functions**

**Functionality:**

These Python functions interact with BigQuery to authenticate, create datasets and tables, and load data from Cloud Storage.

**Key Concepts:**

* **Client:** The `bigquery.Client` object is central to interacting with BigQuery.
* **Datasets and Tables:** Datasets organize tables, and both can be created or overwritten using these functions.
* **Data Loading:** Data is loaded from Cloud Storage in newline-delimited JSON format, and BigQuery automatically infers the schema.
* **Error Handling:** The functions use logging and try-except blocks to handle errors and provide informative messages.

**Potential Enhancements:**

* **Progress Reporting:** Consider logging loading progress.
* **Data Validation:** Explore adding data validation checks before loading.


## **BigQuery Processing Functions**

**Functionality:**

- **Processes BigQuery Results:** The `process_bigquery_results` function executes the query, handles results, and converts them into a desired format (list of tuples with date and username).

**Key Concepts:**

- **Type Hints:** Employs type hints (`List`, `Tuple`, `datetime.date`) for improved code readability and potential static type checking.
- **Error Handling:** Incorporates `try-except` blocks to gracefully handle exceptions (`BadRequest` and generic exceptions).
- **Data Conversion:** Converts retrieved data rows into the specified format.

**Overall Assessment:**

- **Clear Separation:** Functions promote modularity and reusability.
- **Meaningful Variable Names:** Descriptive names enhance code understandability.
- **Error Management:** Handles potential errors during query execution and processing.

**Potential Enhancements:**

- **Input Validation:** Consider validating the constructed query string before execution.
- **Logging:** Integrate logging for detailed tracking and debugging.
- **Security:** Ensure secure credential management for BigQuery access.
- **Query Parameterization:** If DATASET_NAME and TABLE_NAME are not intended for hardcoding, utilize BigQuery's query parameters for better reusability and security.
- **Data Usage:** Currently, the extracted data is printed. You can modify this section to store the data in a desired location or perform further processing.

This code provides a foundation for working with BigQuery data retrieval and processing. You can extend it based on your specific needs.

## **BigQuery Queries**

**Snippet 1: Top 10 Dates with Top Users**



```
[(datetime.date(2021, 2, 12), 'RanbirS00614606'),
 (datetime.date(2021, 2, 13), 'MaanDee08215437'),
 (datetime.date(2021, 2, 17), 'RaaJVinderkaur'),
 (datetime.date(2021, 2, 16), 'jot__b'),
 (datetime.date(2021, 2, 14), 'rebelpacifist'),
 (datetime.date(2021, 2, 18), 'neetuanjle_nitu'),
 (datetime.date(2021, 2, 15), 'jot__b'),
 (datetime.date(2021, 2, 20), 'MangalJ23056160'),
 (datetime.date(2021, 2, 23), 'Surrypuria'),
 (datetime.date(2021, 2, 19), 'Preetm91')]
```



**Functionality**

This SQL query identifies the top 10 dates with the most tweets and, for each of those dates, finds the user with the most tweets (considering usernames alphabetically in case of ties).

**Key Concepts**

* **Common Table Expressions (CTEs):** The query utilizes two CTEs:
    * `TopDates`: Calculates the daily tweet count and ranks them in descending order, selecting the top 10.
    * `TopUsersDate`: Joins the `tweets` table with `TopDates` to find the user(s) with the most tweets for each top date. It uses `ROW_NUMBER()` to handle ties by username order.
* **Window Functions:** `ROW_NUMBER()` is used within `TopUsersDate` to assign a unique row number within each date partition, ordered by tweet count (descending) and then by number of tweets per user (descending).
* **Filtering:** The final result retrieves users with `row_number = 1` (the user with the most tweets for each date).

**Overall Assessment**

This query effectively addresses the task by leveraging CTEs for modularity and window functions to handle ranking and ties.

**Potential Enhancements**

* **Clarity:** Consider adding comments within the query to explain the purpose of each CTE.
* **Efficiency:** Explore alternative approaches to handle ties if performance is critical.

**Data Usage**

The query currently prints the `tweets_date` and `username`. You might want to consider storing this information in a table or using it for further analysis.

**Snippet 2: Top 10 Most Used Emojis**



```
[('🙏', 7286),
('😂', 3072),
('🚜', 2972),
('✊', 2411),
('🌾', 2363),
('🇮', 2096),
('🇳', 2094),
('❤', 1779),
('👇', 1108),
('💚', 1040)]
```



**Functionality**

This query extracts emojis from tweets and identifies the top 10 most frequently used emojis along with their counts.

**Key Concepts**

* **Regular Expressions (RegEx):** The `REGEXP_REPLACE()` function utilizes a complex RegEx pattern to capture a wide range of emoji characters across different Unicode blocks.
* **UNNEST:** The `UNNEST()` operator is used to explode the extracted emoji list into a single row per emoji for counting.

**Overall Assessment**

This query effectively extracts and counts emojis, providing valuable insights into emoji usage.

**Potential Enhancements**

* **Filtering:** Depending on the analysis goals, you might want to filter out specific emoji categories (e.g., flags, country codes).
* **Normalization:** Consider normalizing emojis to a canonical form to handle variations (e.g., skin tone modifiers).

**Data Usage**

The query currently prints the `emoji` and `count`. You could store this information for further analysis of emoji popularity.

**Snippet 3: Top 10 Influential Users**



```
[('narendramodi', 2265),
 ('Kisanektamorcha', 1840),
 ('RakeshTikaitBKU', 1644),
 ('PMOIndia', 1427),
 ('RahulGandhi', 1146),
 ('GretaThunberg', 1048),
 ('RaviSinghKA', 1019),
 ('rihanna', 986),
 ('UNHumanRights', 962),
 ('meenaharris', 926)]
```



**Functionality**

This query identifies the top 10 users with the most mentions (`@username`) received in tweets.

**Key Concepts**

* **UNNEST:** Similar to snippet 2, `UNNEST()` is used to explode the mentioned user list from each tweet for counting mentions.

**Overall Assessment**

This query effectively identifies influential users based on mentions.

**Potential Enhancements**

* **Filtering:** You might consider filtering out self-mentions or mentions from specific accounts.
* **Weighted Mentions:** Depending on the analysis goals, explore assigning weights to mentions based on factors like follower count.

**Data Usage**

The query currently prints the `username` and `mention_count`. You could store this information for further analysis of user influence.

# **Good Luck!!!**


## **Notes**

### Reflection and Continuous Improvement

**Release:**

* "Release" is used for final testing before releasing a version to "main". I used it 3 times at the beginning.
* Gitflow makes using Git easier by saving commands.
* **Be careful with the commit text:**
    * The "ammend" command generates a "force push" and 2) does not work as it should.
    * Make sure the branch is not deleted in the PR.
    * Gitflow deletes it by default when closing the flow. I configured it not to.
    * Renaming a branch also deletes it.
* **Always do "pull" before working!**
* **It is better to make a pull request for a single commit.**
* At first I used Gitflow practices without Gitflow, but then I started using it completely.
* Gitflow is simpler to use, but the branch naming conventions were difficult for me (e.g. "feature_ignore" vs "feature/ignore").
* **Releases generate a "release tag" and homologate both "develop" and "main".**
* I hesitated between using English or Spanish, but I understand that English is better for universality, consistency, conventions and compatibility with tracking tools.
* I had to figure out how to respond to both Gitflow and pull requests. I mixed Gitflow, Github Desktop, Github's graphical interface and Visual Studio Code to achieve this.
* It took me time to achieve a uniform and constant rhythm through the branches and progress. I learned to start all features first, then upload all PRs and finally approve them all. I think this would work well if I'm working in a team, as I've checked that no errors occur.
* I experienced minor divergence conflicts between the origin and local branches, this is avoided by following the steps learned throughout this challenge.
* **Hotfix is dangerous because it makes changes directly to "main". I tried it and prefer to avoid it.**
* I experienced a major conflict with merge mixes in different branches due to not following a logical order of steps when there are multiple local changes. For the sake of time, I had to forcefully solve it by approving a pull request that authorized the emergency merge. This was similar to a hotfix but since I didn't understand the concept well. When I detected this situation I had to make a backup immediately, as well as save the files that Visual Studio Code keeps temporarily even if they are deleted. Finally I solved it and renamed the branches with “revert” and “merge” to follow the hotfix nomenclature.

**Thoughts on Scala:**

* It is important to handle inputs and outputs correctly, so I used type annotations to control IOs mostly.
* Documentation, clean code, assumptions, error handling, edge cases and tests. I gave Gemini the context in the readme to generate assessments and documentation of my code.

**Mistakes:**

* Initially I focused on getting the code working and then organizing it into separate files. This was a mistake. I should have done it from the start, as it improves organization, readability, maintenance, modularity and benefits teamwork.
* **Learning:**
    * I still have a lot to learn and if I do well, I should make sure to absorb all the GIT practices I can learn from the experience with other people.
    * This challenge was a great challenge for me, as I had no experience with GCP. My strength is Spark and Databricks, which is much simpler to use. However, I focused on achieving this challenge with Google Cloud Storage and BigQuery.
    * In the end, I ran everything to verify that it works well. However, and therefore, I didn't realize that when I commented my code using Gemini's help, it changed the outputs of some of my definitions. I was always careful because I noticed it several times. In fact, I asked it not to do it. But I must be more careful and do thorough reviews before moving on to final steps in the main branches. I just checked everything focusing on outputs to check that everything is working well.

### Configurations

* **To only merge in "dev" in releases and thus go through pull request in "main":**

```
git config gitflow.release.finish.merge 0
```

* **To prevent development branches from being deleted on the origin:**

```
git config gitflow.feature.finish.keepremote=true
git config --global gitflow.release.keepremote true
```

### Features

* **To start a feature:**

```
git flow feature start feature
git checkout -b feature/
```

* **Commits and PR in "develop" branch:**

```
feat: …
Detail
```

* **To finish a feature:**

```
git flow feature finish feature
restore branch in PR
```

### Releases (caution)

* **Full preventive pull:**

```
git pull --all
```

* **To start a release:**

```
git flow release start v0.0.
git push origin release/v0.0.
git log --oneline > changelog.txt
```

* **In Github:**

```
rel: …
Detail
```

* **PR in "develop" branch:**

```
git flow release finish 'v0.0.2' (this generates the tag)
tag description: “Version 0.0.2
```

### Hotfixes (caution)

* **To start a hotfix:**

```
git flow hotfix start reload
```

* **Commits and PR in "main" branch:**

```
hotfix: Fix…
Detail
```

* **To finish a hotfix:**

```
git flow hotfix finish reload
tag description: hotfix/v0.0.2
```

**Important:** Use hotfixes sparingly and only for critical issues in the production environment.

### Useful Commands

* To initialize a Gitflow workflow:

```
git flow init
```

* To delete a tag:

```
git tag -d 0.0.1
```

* To list tags:

```
git tag --list
```

* To list branches (local and remote):

```
git branch -a
```

* To delete a local branch:

```
git branch -D release/0.0.1
```

* To abort a release (avoid if possible):

```
git flow release delete -f v0.0.3
```

* To show Gitflow configuration:

```
git config --get-regexp '^gitflow\..*'
```

* To delete Gitflow configuration:

```
git config --unset gitflow.feature.finish.keep
```

* To pull all remote branches into your local repository:

```
git pull --all
```

### Skills

* Acheived
 * Release process
 * Postman POST request preparation
 * Unit tests
 * Clean code
 * Documentation
 * IA support: Gemini (focused on Google) and ChatGPT (for coding)
 * Self and third assumptions
 * Avoid unnecessary disk/memory/cpu processing
 * I/O
 * Google Cloud (Storage, BigQuery, IAM)
 * CTE and SQL (RegExp, Unnest)
 * GIT skills (good practices and git flow practices)
   * git-flow
   * git practices
   * git network
   * pull requests
   * hotfix
   * release
   * feature
   * git commands

* Partially achieved
 * Logging to Python files (there are various py files without logging)
 * File organization (acheived at the end instead of the beginning. Besides, I miss a better folder structure for the project)
 * TDD (I didn't save my tests in Google Colab Notebook, that is how I acheived the code as it was new technologies for me, my strength was with Spark and Databricks)
 * Error handling (I tried to consider error handling and board cases. Focusing especifically in inputs and outputs)
 * Type Annotations (This was very helpful for me to understand what type of data I receive or return, very useful for understanding how Google Cloud libraries work. I had to look at the code of a couple of them to understand them.)
 * Functional Programming (First-class functions, pure functions, and function composition. I am interested in learning Cats for Scala.)
 * Decorators
 * Time and memory measurement

* Not acheived
 * Documented Improvements (There is a lot that can be improved but I did my best)
 * GitHub Actions (I am really interested on this to automate the manual jobs that I had to do on this challenge)
 * Unit Test (I didn't have the time to test with unit tests in my IDE, I only created them with some tests to see if I tested well my code when developing)

### Insights from ChatGPT
In your code, you've incorporated several practices and techniques related to Python programming and interaction with the Google Cloud Platform:

1. **Logging:** Utilizing the `logging` module to record events and informative messages throughout function execution.

2. **Type Annotations:** Employing type hints (`typing`) to specify parameter and return value types in functions, enhancing code readability and error detection.

3. **Exception Handling:** Implementing error management to address potential issues during authentication, dataset and table creation, and data loading to BigQuery, enhancing code robustness.

4. **Google Cloud API Usage:** Interacting with the BigQuery API within the Google Cloud Platform to execute operations like authentication, dataset and table creation, and data loading.

5. **Functional Programming:** While not explicitly applying functional programming design patterns, you adhere to certain functional programming principles by treating functions as first-class citizens and maintaining purity by minimizing side effects.

6. **Dependency Injection:** Passing the BigQuery client as a parameter to functions requiring it, following the dependency injection principle, thereby promoting modularity and code reusability.

7. **Adherence to Good Development Practices:** Demonstrating good practices such as using descriptive function and variable names, implementing explicit error handling, and organizing code into specialized functions.

In summary, your code exhibits a structured and meticulous approach to interfacing with the BigQuery API, leveraging Python-specific features and adhering to sound development principles.

**Areas of improvement:**
Here are some areas where you could consider improvements or refinements in your code:

1. **Error Handling:** While you have implemented error handling, consider adding more specific error messages or logging details to aid in troubleshooting.

2. **Unit Testing:** You've mentioned an interest in unit testing; consider implementing unit tests for your functions to ensure their correctness and robustness.

3. **Code Duplication:** There are some repetitive patterns in your code, such as logging messages and dataset/table existence checks. Consider refactoring these into reusable functions or decorators to adhere more closely to the DRY (Don't Repeat Yourself) principle.

4. **Function Size:** Some of your functions could be broken down into smaller, more focused functions. This can improve readability and maintainability.

5. **Documentation:** While you have docstrings for your functions, consider adding more detailed explanations, especially regarding the purpose and behavior of each function.

6. **Configuration Management:** Externalize configuration settings (e.g., project ID, dataset names) to separate configuration files or environment variables for better manageability.

7. **Consistent Logging:** Ensure consistency in logging levels and messages throughout your codebase for easier debugging and monitoring.

8. **Data Validation:** Consider adding more robust data validation, especially when loading data from external sources, to ensure data integrity and prevent potential issues.

9. **Code Comments:** While your code is generally well-structured and readable, adding comments to explain complex logic or decision-making processes can further improve readability for others (and your future self).

10. **Security Considerations:** Ensure that your code follows best practices for handling sensitive information, such as project IDs and authentication credentials, to prevent accidental exposure.

By addressing these areas, you can enhance the readability, maintainability, and reliability of your codebase.

### Insights from Gemini
Based on the code snippets you've shown, here's what I can glean about your potential programming interests and skills:

**Areas of Interest:**

- **Python:** You seem comfortable working with Python code, utilizing functions, type hints, docstrings, and iterating through data structures.
- **Clean Code Principles:** You're interested in writing clean and well-structured code, as evidenced by your desire to improve code readability and maintainability.
- **Data Analysis:** The `q1_time` function suggests you might be working with BigQuery for data analysis tasks.
- **Environment Variables:** The code snippet using `os.environ` indicates an interest in accessing and potentially using system environment variables.
- **Time Measurement:** The `measure_elapsed_time` function demonstrates an understanding of measuring elapsed time in Python.

**Overall Skills:**

- **Code Comprehension:** You can effectively understand and explain the purpose and functionality of code snippets.
- **Code Improvement:** You're interested in applying clean code principles to improve code quality.
- **Problem-Solving:** You can break down code into its components and analyze its functionality.

**Additional Considerations:**

- The specific details of the code snippets (like `q1_time`) might hint at a larger project you're working on, but without more context, it's difficult to say for sure.
- The lack of code execution (due to my safety restrictions) prevents me from observing how the code interacts with external systems or libraries.

**Areas for Improvement:**

## Areas for Improvement Based on Your Code:

**1. Handling Errors and Exceptions:**

- While the `q1_time` function includes basic error handling with `try...except`, consider being more specific about the exceptions you might encounter (e.g., `bigquery.QueryJobClientError`). This allows for more tailored error messages and handling.

**2. Testing:**

The provided snippets don't explicitly show unit tests. Implementing unit tests for your functions would ensure their correctness and reliability as your code evolves.

**3. Logging:**

While the code prints information to the console, consider using a proper logging library like `logging` for a more structured and centralized approach. This helps track code execution and identify potential issues more easily.

**4. Modularity:**

The `q1_time` function is a good example of modularity. Explore further opportunities to break down larger code blocks into smaller, reusable functions with clear responsibilities.

**5. Readability:**

- **Variable Names:** While variable names like `key` and `value` are generally understandable, consider using more descriptive names when appropriate. For example, in `q1_time`, `date_field` and `string_field` could be more specific based on the actual data.
- **Comments:** Add comments to explain complex logic or non-obvious parts of the code, especially when dealing with external libraries or specific data structures.

**6. Specific Libraries/Frameworks:**

- The `q1_time` function utilizes the BigQuery client library. Consider exploring more advanced features of the library for data manipulation or transformation within BigQuery itself (if needed).

**Additional Tips:**

- **Code Formatting:** Consistent code formatting using tools like `autopep8` improves readability.
- **Static Code Analysis:** Utilize static code analysis tools like `pylint` or `mypy` to identify potential issues early on.

Remember, clean code is an ongoing process. By continuously striving for improvement in these areas, you'll write code that is not only functional but also maintainable, readable, and robust.