En este archivo puedes escribir lo que estimes conveniente. Te recomendamos detallar tu solución y todas las suposiciones que estás considerando. Aquí puedes ejecutar las funciones que definiste en los otros archivos de la carpeta src, medir el tiempo, memoria, etc.

In [13]:
file_path = "../../data/farmers-protest-tweets-2021-2-4.json"

## Testing

Wrote a simple set of tests using pytest so that we can be sure that the outputs match.

In [16]:
!pytest tests.py

platform darwin -- Python 3.11.6, pytest-8.1.1, pluggy-1.4.0
rootdir: /Users/bjuanm/Desktop/Interviews/LATAM/tweets-analysis/challenge_DE/src
collected 3 items                                                              [0m

tests.py [32m.[0m[32m.[0m[32m.[0m[32m                                                             [100%][0m



## **Outputs** -- With memory profiling and time tracking

For excercise one everything is the same, as it is explained in the q1_time script, for the rest of functions we can see in the output of the memory profiler and also with the time tracker, how the functions with a memory optimization approach take longer to run but use less memory.

In [2]:
print("Q1 Time")
!python3 q1_time.py

Q1 Time
Filename: /Users/bjuanm/Desktop/Interviews/LATAM/tweets-analysis/challenge_DE/src/q1_time.py

Line #    Mem usage    Increment  Occurrences   Line Contents
     8     48.0 MiB     48.0 MiB           1   @profile
     9                                         def q1_time(file_path: str) -> List[Tuple[datetime.date, str]]:
    10                                         
    11                                             """
    12                                             I tried parallelizing this function as well so that it would be time optimized (like the following ones),
    13                                             But I think I wasn't being able to do  it properly as running parallel functions and merging the results
    14                                             was taking longer than just do it sequentially. Perhaps because of the double complexity of finding 
    15                                             the most common date and also retrieving the user 

In [3]:
print("Q1 Memory")
!python3 q1_memory.py

Q1 Memory
Filename: /Users/bjuanm/Desktop/Interviews/LATAM/tweets-analysis/challenge_DE/src/q1_memory.py

Line #    Mem usage    Increment  Occurrences   Line Contents
     8     49.0 MiB     49.0 MiB           1   @profile
     9                                         def q1_memory(file_path: str) -> List[Tuple[datetime.date, str]]:
    10                                         
    11                                             # Using a defaultdict we will have constant-time complexity when we insert values and perform lookups
    12     49.7 MiB   -419.5 MiB          27       agg_dict = defaultdict(lambda: defaultdict(int))
    13                                         
    14                                             # try-catch in case the file doesn't exist or the path is wrong
    15     49.0 MiB      0.0 MiB           1       try:
    16                                                 # Open json file with tweets data to loop through every record
    17     49.0 MiB    -3

In [4]:
print("Q2 Time")
!python3 q2_time.py

Q2 Time


Filename: /Users/bjuanm/Desktop/Interviews/LATAM/tweets-analysis/challenge_DE/src/q2_time.py

Line #    Mem usage    Increment  Occurrences   Line Contents
    27     65.7 MiB     65.7 MiB           1   @profile
    28                                         def q2_time(file_path: str) -> List[Tuple[str, int]]:
    29                                         
    30                                             # We will update this counter with the output of each pool so we can aggregate at the end
    31     65.7 MiB      0.0 MiB           1       emojis_counter = Counter()
    32                                         
    33                                             # try-catch in case the file doesn't exist or the path is wrong
    34     65.7 MiB      0.0 MiB           1       try:
    35                                                 # Open json file with tweets data to loop through every record
    36    186.9 MiB      0.0 MiB           2           with open(file_path, 'r') as

In [5]:
print("Q2 Memory") #This one takes forever to run so please be patient if you run this cell
!python3 q2_memory.py

Q2 Memory
Filename: /Users/bjuanm/Desktop/Interviews/LATAM/tweets-analysis/challenge_DE/src/q2_memory.py

Line #    Mem usage    Increment  Occurrences   Line Contents
     8     64.9 MiB     64.9 MiB           1   @profile
     9                                         def q2_memory(file_path: str) -> List[Tuple[str, int]]:
    10                                             
    11                                             # Counter objects provide time optimized methods like the one we will be using to get the most popular emojis
    12     64.9 MiB      0.0 MiB           1       emojis_counter = Counter()
    13                                         
    14                                             # try-catch in case the file doesn't exist or the path is wrong
    15     64.9 MiB      0.0 MiB           1       try:
    16                                                 # Open json file with tweets data to loop through every record
    17     64.9 MiB    -47.4 MiB           2 

In [6]:
print("Q3 Time")
!python3 q3_time.py

Q3 Time
Filename: /Users/bjuanm/Desktop/Interviews/LATAM/tweets-analysis/challenge_DE/src/q3_time.py

Line #    Mem usage    Increment  Occurrences   Line Contents
    42     48.3 MiB     48.3 MiB           1   @profile
    43                                         def q3_time(file_path: str) -> List[Tuple[str, int]]:
    44                                         
    45    195.3 MiB      0.0 MiB           2       with open(file_path, 'r') as file:
    46                                         
    47                                                 # Create a pool of processes to run in parallel
    48    195.3 MiB      0.7 MiB           2           with Pool() as pool:
    49                                         
    50                                                     # Apply the get_mentions function with file path argument
    51     49.1 MiB      0.0 MiB           1               process_line = partial(get_mentions)
    52                                         
    53   

In [7]:
print("Q3 Memory")
!python3 q3_memory.py

Q3 Memory
Filename: /Users/bjuanm/Desktop/Interviews/LATAM/tweets-analysis/challenge_DE/src/q3_memory.py

Line #    Mem usage    Increment  Occurrences   Line Contents
     7     48.9 MiB     48.9 MiB           1   @profile
     8                                         def q3_memory(file_path: str) -> List[Tuple[str, int]]:
     9                                             
    10     48.9 MiB      0.0 MiB           1       mentions_counter = Counter()
    11                                         
    12                                             # I found out that there is a mentionedUsers key with the tweet mentions so I will use that instead
    13                                             # of the regular expression that I used to have
    14     50.9 MiB      0.0 MiB           2       with open(file_path, 'r') as file:
    15     50.9 MiB      0.5 MiB      117408           for line in file:
    16     50.9 MiB      1.1 MiB      117407               tweet = json.loads(line)
