## LT Challenge Solution ## 

### Objetivos
1. The main solution goal of this case is to explore memory usage and optimization within a distributed and scalable data processing environment like Spark.
2. Explore scalable cloud solutions such as Cloud Storage, Cloud Build, and Cloud Run on GCP.
3. Deploy a reproducible environment using Docker ready to run Jupyter and PySpark
4. Deploy an automated Cloud Build CI for my Docker image to GCP Artifact Registry.
5. Establish and adhere to a Git flow to establish an efficient workflow for organizing features, builds, and testing tasks.
6. Implement test-driven development (TDD) to address challenging questions.
7. Implement data transformations to define data quality layers.
8. Explore data transformation analytical techniques to address challenging questions.

## LT Data Analysis Challenge - step 1 --> q1_memory.py

### Description
The script performs the following operations:

1. Spark initialization: Create a Spark session to process the data.
2. Reading JSON File: Reads data from a specified JSON file, using a defined schema that includes fields such as identifier, username, message, and date.
3. Data analysis:
     - Group data by date and count messages to identify the 10 dates with the most activity.
     - Filter the data to only keep messages from these dates.
     - Count each user's messages on these dates to determine the most active user per day.
4. Results: The script returns a list of tuples, each containing a date and the name of the most active user on that date.

### Execution Steps
To execute this Spark application, follow these steps:
1. Install the necessary Python packages (`requirements.txt`)
2. The data file ´´farmers-protest-tweets-2021-2-4.json´´ is in the data/raw folder, compressed in a .rar, it must be decompressed before executing the code.
3. Run the Spark application, passing the path to the JSON file as an argument.

### Review the results:
The script will print the results to the console, showing the dates with the most messages and the most active user on those dates.
The memory usage analysis results will be displayed in the console if memory_profiler is active.

## LT Data Analysis Challenge - step 2 --> q1_time.py

### Description
The script performs the following operations:

1. Initializes a Spark session and reads data from the specified file.
2. Persist the DataFrame in memory and disk to optimize performance.
3. Group the data by creation date and count the posts for each date.
4. Filter the 10 dates with the most posts and get the most active users on those dates.
5. We use a window function to sort users by their activity and select the most active for each date.

### Execution Steps
To execute this Spark application, follow these steps:
1. Install the necessary Python packages (`requirements.txt`)
2. The data file ´´farmers-protest-tweets-2021-2-4.json´´ is in the data/raw folder, compressed in a .rar, it must be decompressed before executing the code.
3. Run the script from the command line, passing the JSON file path as an argument:
Copy code
```bash 
python q1_memory.py /data/raw/farmers-protest-tweets-2021-2-4.json
```

### Review the results:
The script will print the results to the console, showing the 10 dates with the most posts and gets the most active users on those dates.

## LT Data Analysis Challenge - step 3 --> q1_memory.py

### Description
The script performs the following operations:

1. Initialize a Spark session and read tweet data from the JSON file.
2. Extract emojis from the tweet content using a user-defined function (UDF).
3. Count the occurrences of each emoji.
4. Sort the emojis by frequency, with alphabetical order for tiebreakers.
5. Returns the 10 most used emojis along with their counts.

### Execution Steps
To execute this Spark application, follow these steps:
1. Install the necessary Python packages (`requirements.txt`)
2. The data file ´´farmers-protest-tweets-2021-2-4.json´´ is in the data/raw folder, compressed in a .rar, it must be decompressed before executing the code.
3. Run the script from the command line, passing the JSON file path as an argument:
Copy code
```bash 
python q2_memory.py /data/raw/farmers-protest-tweets-2021-2-4.json
```

### Review the results:
The script will print the results to the console, displaying 10 most used emojis along with their counts.
The memory usage analysis results will be displayed in the console if memory_profiler is active.