### Desafio LATAM DE
#####  Realizado por: Carlos Mendoza R.
Viernes 22 de Marzo 2024.

##### Introducción.
En este desafió se realiza la propuesta de solución para tres problemas enfocados cada uno en la optimización de memoria y de tiempo en cada uno de ellos considerando una base de datos de tipo no relacional de formato JSON. De 348Mb de información. Con 117407 registros que representan cada uno de ellos un tweet en relacion a la protesta de farmers. Para ello se desarrolla un repositorio capaz de procesar esta información de la siguiente estructura. 
```
├── README.md
├── data
│   └── farmers-protest-tweets-2021-2-4.json
├── requirements.txt
├── set-up.sh
└── src
    ├── challenge.ipynb
    ├── q1_memory.py
    ├── q1_time.py
    ├── q2_memory.py
    ├── q2_time.py
    ├── q3_memory.py
    ├── q3_time.py
    └── utils.py
```

Se utiliza un archivo bash set-up.sh para dejar listo el ambiente de python y además para ir creando la estructura extra a la entregada por el repositorio en la plataforma git. 



En general la forma de plantear las optimizaciones por Memoria y Tiempo se ejecuta de la siguiente estructura:

#### Optimización por Memoria. 

Al momento de cargar el archivo este hace recorriendo el archivo json linea por linea, al momento de leer la linea se ejecuta el o los métodos para obtener por linea la información necesaria para dar solución al problema. En pseudo-código es la siguiente estructura.

```
def read_file
    key_information = []
    for line in lines
        key_information += method_to_get_key_information(line)
    end
    return method_to_get_final_solution(key_information)
end
```

El objetivo de esto es que solo se mantiene 'activo' en la memoria la linea y no todo el archivo disminuyendo el espacio ocupado. Sin embargo, en casi todos los casos el tiempo de ejecución va aumentando dado que se ejecuta muchas veces los métodos.

#### Optimización por Tiempo.

Al momento de leer los datos se espera que este todo el archivo cargado luego de esto se ejecutan distintos métodos para obtener los valores importantes para dar a la solución, en cada método se toma toda la información de los datos y se va procesando.  El pseudo-código es de la siguente manera. 

```
def read_file
    ... logic to read file
end

def method_to_get_solution_1
    ...logic 1
end

def method_to_get_solution_2
    ...logic 2
end

data = read_file
key_info1 = method_to_get_solution1(data)
solution = method_to_get_solution2(data)
```
De esta manera se va optimizando el tiempo de proceso. Además de algunas formas más optimas en cada método para obtener la solución.

### Pruebas de códigos

In [16]:
from q1_memory import q1_memory

file_path = "../data/farmers-protest-tweets-2021-2-4.json"
q1_memory(file_path)

# Function 'q1_memory' executed in 5.490770 seconds
# [(datetime.date(2021, 2, 12), 'RanbirS00614606'),
#  (datetime.date(2021, 2, 13), 'MaanDee08215437'),
#  (datetime.date(2021, 2, 17), 'RaaJVinderkaur'),
#  (datetime.date(2021, 2, 16), 'jot__b'),
#  (datetime.date(2021, 2, 14), 'rebelpacifist'), 
#  (datetime.date(2021, 2, 18), 'neetuanjle_nitu'),
#  (datetime.date(2021, 2, 15), 'jot__b'), 
#  (datetime.date(2021, 2, 20), 'MangalJ23056160'),
#  (datetime.date(2021, 2, 23), 'Surrypuria'),
#  (datetime.date(2021, 2, 19), 'Preetm91')]


## RESULTS
# Line #    Mem usage    Increment  Occurrences   Line Contents
# =============================================================
#     44     43.1 MiB     43.1 MiB           1   @timer
#     45                                         @profile
#     46                                         def q1_memory(file_path: str) -> List[Tuple[datetime.date, str]]:
#     47     41.3 MiB     -1.9 MiB           1       dates_data = count_dates(file_path)
#     48     41.8 MiB      0.5 MiB           1       result = get_top_users_per_date(dates_data)
#     49     41.8 MiB      0.0 MiB           1       return result


Filename: /Users/Carlos/Documents/Scripts/Desafios/latam_de/src/q1_memory.py

Line #    Mem usage    Increment  Occurrences   Line Contents
    44     26.4 MiB     26.4 MiB           1   @timer
    45                                         @profile
    46                                         def q1_memory(file_path: str) -> List[Tuple[datetime.date, str]]:
    47     35.8 MiB      9.4 MiB           1       dates_data = count_dates(file_path)
    48     35.9 MiB      0.0 MiB           1       result = get_top_users_per_date(dates_data)
    49     35.9 MiB      0.0 MiB           1       return result


Function 'q1_memory' executed in 5.793476 seconds


[(datetime.date(2021, 2, 12), 'RanbirS00614606'),
 (datetime.date(2021, 2, 13), 'MaanDee08215437'),
 (datetime.date(2021, 2, 17), 'RaaJVinderkaur'),
 (datetime.date(2021, 2, 16), 'jot__b'),
 (datetime.date(2021, 2, 14), 'rebelpacifist'),
 (datetime.date(2021, 2, 18), 'neetuanjle_nitu'),
 (datetime.date(2021, 2, 15), 'jot__b'),
 (datetime.date(2021, 2, 20), 'MangalJ23056160'),
 (datetime.date(2021, 2, 23), 'Surrypuria'),
 (datetime.date(2021, 2, 19), 'Preetm91')]

In [17]:
from q1_time import q1_time

q1_time(file_path)

# [(datetime.date(2021, 2, 12), 'RanbirS00614606'),
#  (datetime.date(2021, 2, 13), 'MaanDee08215437'),
#  (datetime.date(2021, 2, 17), 'RaaJVinderkaur'),
#  (datetime.date(2021, 2, 16), 'jot__b'),
#  (datetime.date(2021, 2, 14), 'rebelpacifist'),
#  (datetime.date(2021, 2, 18), 'neetuanjle_nitu'),
#  (datetime.date(2021, 2, 15), 'jot__b'),
#  (datetime.date(2021, 2, 20), 'MangalJ23056160'),
#  (datetime.date(2021, 2, 23), 'Surrypuria'),
#  (datetime.date(2021, 2, 19), 'Preetm91')]

# Line #    Mem usage    Increment  Occurrences   Line Contents
# =============================================================
#     47     80.0 MiB     80.0 MiB           1   @timer
#     48                                         @profile
#     49                                         def q1_time(file_path: str) -> List[Tuple[datetime.date, str]]:
#     50     80.0 MiB      0.0 MiB           1       result = list()
#     51    142.1 MiB     62.0 MiB           1       df = json_to_df(file_path)
#     52    142.1 MiB      0.0 MiB           1       date_df = df['date']
#     53    142.5 MiB      0.4 MiB           1       top_10_dates = get_top_10_dates(date_df)
#     54    143.7 MiB     -9.7 MiB          11       for date in top_10_dates:
#     55    143.7 MiB     -8.5 MiB          10           user = get_tweeter_user_per_day(date, df)
#     56    143.7 MiB     -9.7 MiB          10           result.append((date, user))
#     57    142.7 MiB     -1.0 MiB           1       return result


# Function 'q1_time' executed in 6.556365 seconds


Filename: /Users/Carlos/Documents/Scripts/Desafios/latam_de/src/q1_time.py

Line #    Mem usage    Increment  Occurrences   Line Contents
    47     37.4 MiB     37.4 MiB           1   @timer
    48                                         @profile
    49                                         def q1_time(file_path: str) -> List[Tuple[datetime.date, str]]:
    50     37.4 MiB      0.0 MiB           1       result = list()
    51     92.4 MiB     55.0 MiB           1       df = json_to_df(file_path)
    52     92.5 MiB      0.0 MiB           1       date_df = df['date']
    53     93.7 MiB      1.2 MiB           1       top_10_dates = get_top_10_dates(date_df)
    54     97.7 MiB      0.0 MiB          11       for date in top_10_dates:
    55     97.7 MiB      4.0 MiB          10           user = get_tweeter_user_per_day(date, df)
    56     97.7 MiB      0.0 MiB          10           result.append((date, user))
    57     97.7 MiB      0.0 MiB           1       return result


Function

[(datetime.date(2021, 2, 12), 'RanbirS00614606'),
 (datetime.date(2021, 2, 13), 'MaanDee08215437'),
 (datetime.date(2021, 2, 17), 'RaaJVinderkaur'),
 (datetime.date(2021, 2, 16), 'jot__b'),
 (datetime.date(2021, 2, 14), 'rebelpacifist'),
 (datetime.date(2021, 2, 18), 'neetuanjle_nitu'),
 (datetime.date(2021, 2, 15), 'jot__b'),
 (datetime.date(2021, 2, 20), 'MangalJ23056160'),
 (datetime.date(2021, 2, 23), 'Surrypuria'),
 (datetime.date(2021, 2, 19), 'Preetm91')]

In [18]:
from q2_memory import q2_memory

q2_memory(file_path)


# [('🙏', 8916),
#  ('😂', 4067),
#  ('🚜', 3274),
#  ('✊', 3046),
#  ('🏻', 2696),
#  ('🌾', 2652),
#  ('🇳', 2421),
#  ('🇮', 2417),
#  ('❤', 2263),
#  ('🤣', 2110)]

# Line #    Mem usage    Increment  Occurrences   Line Contents
# =============================================================
#     27     55.7 MiB     55.7 MiB           1   @timer
#     28                                         @profile
#     29                                         def q2_memory(file_path: str) -> List[Tuple[str, int]]:
#     30     55.7 MiB      0.0 MiB           1       pattern = EMOJI_PATTERN
#     31     55.7 MiB      0.0 MiB           1       emoji_counts = Counter()
#     32     55.8 MiB    -35.6 MiB           2       with open(file_path, 'r') as file:
#     33     57.1 MiB -3278819.6 MiB      117408           for line in file:
#     34     57.1 MiB -3278787.3 MiB      117407               try:
#     35     57.1 MiB -3278796.0 MiB      117407                   tweet = json.loads(line)
#     36     57.1 MiB -3278799.6 MiB      117407                   text = tweet.get('renderedContent', '').replace('\n', '')
#     37     57.1 MiB -3278805.4 MiB      117407                   emoji_counts.update(re.findall(pattern, text))
#     38     57.1 MiB -3278808.6 MiB      117407                   tweet_content = tweet.get('quotedTweet', False)
#     39     57.1 MiB -3278811.2 MiB      117407                   if tweet_content:
#     40     57.1 MiB -2275699.9 MiB       82872                       emoji_counts.update(
#     41     57.1 MiB -2275698.3 MiB       82872                           re.findall(
#     42     57.1 MiB -1137848.1 MiB       41436                               pattern,
#     43     57.1 MiB -1137849.2 MiB       41436                               tweet_content.get('renderedContent', '')
#     44                                                                 )
#     45                                                             )
#     46                                                     except json.JSONDecodeError as e:
#     47                                                         print(f"Error decoding JSON in line: {line}. Error: {e}")
#     48     21.7 MiB    -34.0 MiB           1       return emoji_counts.most_common(10)


# Function 'q2_memory' executed in 29.172508 seconds

Filename: /Users/Carlos/Documents/Scripts/Desafios/latam_de/src/q2_memory.py

Line #    Mem usage    Increment  Occurrences   Line Contents
    27    105.7 MiB    105.7 MiB           1   @timer
    28                                         @profile
    29                                         def q2_memory(file_path: str) -> List[Tuple[str, int]]:
    30    105.7 MiB      0.0 MiB           1       pattern = EMOJI_PATTERN
    31    105.7 MiB      0.0 MiB           1       emoji_counts = Counter()
    32    105.8 MiB    -80.0 MiB           2       with open(file_path, 'r') as file:
    33    105.9 MiB -6418673.5 MiB      117408           for line in file:
    34    105.9 MiB -6418600.0 MiB      117407               try:
    35    105.9 MiB -6418619.9 MiB      117407                   tweet = json.loads(line)
    36    105.9 MiB -6418626.2 MiB      117407                   text = tweet.get('renderedContent', '').replace('\n', '')
    37    105.9 MiB -6418638.0 MiB      117407          

[('🙏', 8916),
 ('😂', 4067),
 ('🚜', 3274),
 ('✊', 3046),
 ('🏻', 2696),
 ('🌾', 2652),
 ('🇳', 2421),
 ('🇮', 2417),
 ('❤', 2263),
 ('🤣', 2110)]

In [19]:
from q2_time import q2_time

q2_time(file_path)

# [('🙏', 8916),
#  ('😂', 4067),
#  ('🚜', 3274),
#  ('✊', 3046),
#  ('🏻', 2696),
#  ('🌾', 2652),
#  ('🇳', 2421),
#  ('🇮', 2417),
#  ('❤', 2263),
#  ('🤣', 2110)]

# Line #    Mem usage    Increment  Occurrences   Line Contents
# =============================================================
#     59     53.2 MiB     53.2 MiB           1   @timer
#     60                                         @profile
#     61                                         def q2_time(file_path: str) -> List[Tuple[str, int]]:
#     62    188.6 MiB    135.4 MiB           1       tweets_content = json_to_text(file_path)
#     63    189.2 MiB      0.6 MiB           1       qty_emojis = count_emojis(tweets_content)
#     64    189.2 MiB      0.0 MiB           1       return qty_emojis


# Function 'q2_time' executed in 6.400599 seconds

Filename: /Users/Carlos/Documents/Scripts/Desafios/latam_de/src/q2_time.py

Line #    Mem usage    Increment  Occurrences   Line Contents
    59     40.4 MiB     40.4 MiB           1   @timer
    60                                         @profile
    61                                         def q2_time(file_path: str) -> List[Tuple[str, int]]:
    62    170.4 MiB    130.0 MiB           1       tweets_content = json_to_text(file_path)
    63    170.7 MiB      0.3 MiB           1       qty_emojis = count_emojis(tweets_content)
    64    170.7 MiB      0.0 MiB           1       return qty_emojis


Function 'q2_time' executed in 6.480842 seconds


[('🙏', 8916),
 ('😂', 4067),
 ('🚜', 3274),
 ('✊', 3046),
 ('🏻', 2696),
 ('🌾', 2652),
 ('🇳', 2421),
 ('🇮', 2417),
 ('❤', 2263),
 ('🤣', 2110)]

In [None]:
from q3_memory import q3_memory
q3_memory(file_path)

# [('narendramodi', 2607),
#  ('Kisanektamorcha', 2038),
#  ('RakeshTikaitBKU', 1842),
#  ('PMOIndia', 1554),
#  ('GretaThunberg', 1272),
#  ('RahulGandhi', 1230),
#  ('RaviSinghKA', 1120),
#  ('DelhiPolice', 1119),
#  ('rihanna', 1112),
#  ('UNHumanRights', 1057)]

# Line #    Mem usage    Increment  Occurrences   Line Contents
# =============================================================
#      9     43.5 MiB     43.5 MiB           1   @timer
#     10                                         @profile
#     11                                         def q3_memory(file_path: str) -> List[Tuple[str, int]]:
#     12     43.5 MiB      0.0 MiB           1       user_counts = Counter()
#     13     43.5 MiB      0.0 MiB           1       pattern = r'\B@([a-zA-Z0-9_]+)\b'
#     14     43.5 MiB    -28.9 MiB           2       with open(file_path, 'r') as file:
#     15     45.0 MiB -880597.8 MiB      117408           for line in file:
#     16     45.0 MiB -880571.6 MiB      117407               try:
#     17     45.0 MiB -880578.9 MiB      117407                   tweet = json.loads(line)
#     18     45.0 MiB -880583.6 MiB      117407                   users = re.findall(pattern, tweet.get('renderedContent', ''))
#     19     45.0 MiB -880586.7 MiB      117407                   user_counts.update(users)
#     20     45.0 MiB -880589.3 MiB      117407                   quoted_tweet = tweet.get('quotedTweet')
#     21     45.0 MiB -880592.0 MiB      117407                   if quoted_tweet:
#     22     45.0 MiB -302738.7 MiB       41436                       quoted_users = re.findall(pattern, quoted_tweet.get('renderedContent', ''))
#     23     45.0 MiB -302739.5 MiB       41436                       user_counts.update(quoted_users)
#     24                                                     except json.JSONDecodeError as e:
#     25                                                         print(f"Error decoding JSON in line: {line}. Error: {e}")
#     26     16.5 MiB    -27.0 MiB           1       return user_counts.most_common(10)


# Function 'q3_memory' executed in 24.060625 seconds


In [32]:
from q3_time import q3_time

q3_time(file_path)

# [('narendramodi', 2156),
#  ('RakeshTikaitBKU', 1635),
#  ('Kisanektamorcha', 1597),
#  ('PMOIndia', 1355),
#  ('GretaThunberg', 1161),
#  ('RahulGandhi', 1055),
#  ('UNHumanRights', 1021),
#  ('DelhiPolice', 1001),
#  ('rihanna', 949),
#  ('hrw', 907)]

# Line #    Mem usage    Increment  Occurrences   Line Contents
# =============================================================
#     44     53.2 MiB     53.2 MiB           1   @timer
#     45                                         @profile
#     46                                         def q3_time(file_path: str) -> List[Tuple[str, int]]:
#     47    160.0 MiB    106.8 MiB           1       tweets_content = json_to_text(file_path)
#     48    166.4 MiB      6.4 MiB           1       users_qty = count_users(tweets_content)
#     49    166.4 MiB      0.0 MiB           1       return users_qty


# Function 'q3_time' executed in 6.478678 seconds

Filename: /Users/Carlos/Documents/Scripts/Desafios/latam_de/src/q3_time.py

Line #    Mem usage    Increment  Occurrences   Line Contents
    44     53.2 MiB     53.2 MiB           1   @timer
    45                                         @profile
    46                                         def q3_time(file_path: str) -> List[Tuple[str, int]]:
    47    160.0 MiB    106.8 MiB           1       tweets_content = json_to_text(file_path)
    48    166.4 MiB      6.4 MiB           1       users_qty = count_users(tweets_content)
    49    166.4 MiB      0.0 MiB           1       return users_qty


Function 'q3_time' executed in 6.478678 seconds


[('narendramodi', 2156),
 ('RakeshTikaitBKU', 1635),
 ('Kisanektamorcha', 1597),
 ('PMOIndia', 1355),
 ('GretaThunberg', 1161),
 ('RahulGandhi', 1055),
 ('UNHumanRights', 1021),
 ('DelhiPolice', 1001),
 ('rihanna', 949),
 ('hrw', 907)]