<a href="https://colab.research.google.com/github/gbonanno/gab-dataart-challenge/blob/developer-gbonanno/src/challenge.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Engineer Challenge

## Imports y configuraciones iniciales

Comenzamos instalando algunas bibliotecas y haciendo los imports necesarios para trabajar a continuación.  
Además, se descarga el archivo y se declaran algunas variables que se utilizarán más adelante.

In [1]:
pip install emoji

Collecting emoji
  Downloading emoji-2.12.1-py3-none-any.whl (431 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m431.4/431.4 kB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: emoji
Successfully installed emoji-2.12.1


In [2]:
pip install memory_profiler

Collecting memory_profiler
  Downloading memory_profiler-0.61.0-py3-none-any.whl (31 kB)
Installing collected packages: memory_profiler
Successfully installed memory_profiler-0.61.0


In [3]:
# Imports
import pandas as pd
import json
import emoji
import gdown
import zipfile
from collections import defaultdict, Counter
import importlib.util
import sys
from memory_profiler import memory_usage
import cProfile

In [4]:
# Se descarga el archivo de tweets desde su link en el Drive
file_id = '1ig2ngoXFTxP5Pa8muXo02mDTFexZzsis'
file_path_zip = 'farmers-protest-tweets-2021-2-4.json.zip'
file_name = 'farmers-protest-tweets-2021-2-4.json'
local_path = '.'
full_path = local_path + '/' + file_name

gdown.download(f'https://drive.google.com/uc?id={file_id}', file_path_zip, quiet=False)

Downloading...
From (original): https://drive.google.com/uc?id=1ig2ngoXFTxP5Pa8muXo02mDTFexZzsis
From (redirected): https://drive.google.com/uc?id=1ig2ngoXFTxP5Pa8muXo02mDTFexZzsis&confirm=t&uuid=9bf2c2be-6201-4585-91f4-0ce21779868c
To: /content/farmers-protest-tweets-2021-2-4.json.zip
100%|██████████| 60.4M/60.4M [00:01<00:00, 53.7MB/s]


'farmers-protest-tweets-2021-2-4.json.zip'

In [5]:
# Como el archivo está comprimido en formato ZIP, es necesario descomprimirlo
with zipfile.ZipFile(file_path_zip, 'r') as zip_ref:
    zip_ref.extractall(local_path)

## Análisis del archivo

El contenido y las acciones de esta sección no son necesarios para la entrega. Sin embargo, vamos a revisar un poco el contenido del archivo para poder trabajar luego con las funciones que sí fueron solicitadas.

In [6]:
# Abrir el archivo
fp = open(full_path, 'r+')

In [7]:
# Se bajan los tweets
# La lista tweets contiene el JSON completo de todos los tweets
# El dataframe contiene solo los campos necesarios para las queries posteriores
tweets = []
dates = []
users = []
contents = []
mentioned_users = []

for line in fp:
    line_json = json.loads(line)
    tweets.append(line_json)

    if line_json["mentionedUsers"] is not None:
        mentioned = ' | '.join(user["username"] for user in line_json["mentionedUsers"])
    else:
        mentioned = ''

    dates.append(line_json["date"])
    users.append(line_json["user"]["username"])
    contents.append(line_json["content"])
    mentioned_users.append(mentioned)

tweets_campos = pd.DataFrame({
    'date': dates,
    'user': users,
    'content': contents,
    'mentioned_users': mentioned_users
})

In [8]:
# Visualizamos los campos descargados
tweets_campos

Unnamed: 0,date,user,content,mentioned_users
0,2021-02-24T09:23:35+00:00,ArjunSinghPanam,The world progresses while the Indian police a...,narendramodi | DelhiPolice
1,2021-02-24T09:23:32+00:00,PrdeepNain,#FarmersProtest \n#ModiIgnoringFarmersDeaths \...,Kisanektamorcha
2,2021-02-24T09:23:22+00:00,parmarmaninder,ਪੈਟਰੋਲ ਦੀਆਂ ਕੀਮਤਾਂ ਨੂੰ ਮੱਦੇਨਜ਼ਰ ਰੱਖਦੇ ਹੋਏ \nਮੇ...,
3,2021-02-24T09:23:16+00:00,anmoldhaliwal,@ReallySwara @rohini_sgh watch full video here...,ReallySwara | rohini_sgh
4,2021-02-24T09:23:10+00:00,KotiaPreet,#KisanEktaMorcha #FarmersProtest #NoFarmersNoF...,
...,...,...,...,...
117402,2021-02-12T01:37:02+00:00,rickyrickstir,#FarmersProtest #KisanAndolan #KisaanMajdoorEk...,
117403,2021-02-12T01:36:53+00:00,PunjabTak,PM मोदी की अपील के बीच संयुक्त किसान मोर्चा का...,
117404,2021-02-12T01:36:50+00:00,ish_kayy,United we stand.\nDivided we fall\n#Mahapancha...,
117405,2021-02-12T01:36:49+00:00,TV9Bharatvarsh,"सिंघु बॉर्डर पर लंबी लड़ाई की तैयारी, किसानों ...",


In [9]:
# Imprimimos un tweet para revisar su formato
print(json.dumps(line_json, indent=4))

{
    "url": "https://twitter.com/SikhVibes/status/1360040127146430470",
    "date": "2021-02-12T01:36:49+00:00",
    "content": "@Kisanektamorcha We are with you, keep the morcha alive and strong \ud83d\udcaa #FarmersProtest #MahapanchayatRevolution",
    "renderedContent": "@Kisanektamorcha We are with you, keep the morcha alive and strong \ud83d\udcaa #FarmersProtest #MahapanchayatRevolution",
    "id": 1360040127146430470,
    "user": {
        "username": "SikhVibes",
        "displayname": "SikhVibes",
        "id": 1568618503,
        "description": "SikhVibes.com is a premium Sikh Multimedia website with thousands of rare Audio Recordings, Videos and Katha from all over the world!",
        "rawDescription": "https://t.co/YWcGuzCBWn is a premium Sikh Multimedia website with thousands of rare Audio Recordings, Videos and Katha from all over the world!",
        "descriptionUrls": [
            {
                "text": "SikhVibes.com",
                "url": "http://SikhVibes.co

## Descargar funciones de GitHub

Se descargan las versiones de las funciones que se encuentran en el repositorio de GitHub para poder invocarlas posteriormente.

In [10]:
# Descargar las funciones desde GitHub
!git clone https://github.com/gbonanno/gab-dataart-challenge.git

Cloning into 'gab-dataart-challenge'...
remote: Enumerating objects: 57, done.[K
remote: Counting objects: 100% (57/57), done.[K
remote: Compressing objects: 100% (56/56), done.[K
remote: Total 57 (delta 34), reused 0 (delta 0), pack-reused 0[K
Receiving objects: 100% (57/57), 584.83 KiB | 3.23 MiB/s, done.
Resolving deltas: 100% (34/34), done.


In [11]:
# Leer las funciones y ejecutarlas
# ENHANCEMENT: esta sección realiza varias tareas que se repiten por cada función. Se podría mejorar creando algunas funciones auxiliares.
module_path = './gab-dataart-challenge/src/'
module_name1 = 'q1_memory'
module_name2 = 'q1_time'
module_name3 = 'q2_memory'
module_name4 = 'q2_time'
module_name5 = 'q3_memory'
module_name6 = 'q3_time'

# Cargar módulos
spec_q1_memory = importlib.util.spec_from_file_location(module_name1, module_path + module_name1 + '.py')
module_q1_memory = importlib.util.module_from_spec(spec_q1_memory)
sys.modules[module_name1] = module_q1_memory
spec_q1_memory.loader.exec_module(module_q1_memory)

spec_q1_time = importlib.util.spec_from_file_location(module_name2, module_path + module_name2 + '.py')
module_q1_time = importlib.util.module_from_spec(spec_q1_time)
sys.modules[module_name2] = module_q1_time
spec_q1_time.loader.exec_module(module_q1_time)

spec_q2_memory = importlib.util.spec_from_file_location(module_name3, module_path + module_name3 + '.py')
module_q2_memory = importlib.util.module_from_spec(spec_q2_memory)
sys.modules[module_name3] = module_q2_memory
spec_q2_memory.loader.exec_module(module_q2_memory)

spec_q2_time = importlib.util.spec_from_file_location(module_name4, module_path + module_name4 + '.py')
module_q2_time = importlib.util.module_from_spec(spec_q2_time)
sys.modules[module_name4] = module_q2_time
spec_q2_time.loader.exec_module(module_q2_time)

spec_q3_memory = importlib.util.spec_from_file_location(module_name5, module_path + module_name5 + '.py')
module_q3_memory = importlib.util.module_from_spec(spec_q3_memory)
sys.modules[module_name5] = module_q3_memory
spec_q3_memory.loader.exec_module(module_q3_memory)

spec_q3_time = importlib.util.spec_from_file_location(module_name6, module_path + module_name6 + '.py')
module_q3_time = importlib.util.module_from_spec(spec_q3_time)
sys.modules[module_name6] = module_q3_time
spec_q3_time.loader.exec_module(module_q3_time)

# Cargar las funciones desde los módulos
q1_memory = module_q1_memory.q1_memory
q1_time = module_q1_time.q1_time
q2_memory = module_q2_memory.q2_memory
q2_time = module_q2_time.q2_time
q3_memory = module_q3_memory.q3_memory
q3_time = module_q3_time.q3_time

## Mediciones de memoria y tiempo

### Query 1: top 10 de fechas y usuarios

In [12]:
# Medimos el uso de memoria de la función.
mem_usage = memory_usage((q1_memory, (local_path + '/' + file_name,)))
print(f"Uso de memoria: {mem_usage}")

Uso de memoria: [1348.32421875, 1348.51953125, 1349.1015625, 1350.0859375, 1350.8203125, 1351.63671875, 1352.55859375, 1353.18359375, 1353.80859375, 1354.48828125, 1355.046875, 1355.9140625, 1356.6171875, 1357.34375, 1357.8125, 1358.63671875, 1359.390625, 1360.28125, 1361.28125, 1362.00390625, 1362.78515625, 1363.39453125, 1364.34765625, 1365.12109375, 1365.89453125, 1366.5390625, 1367.4140625, 1367.94921875, 1368.68359375, 1369.29296875, 1369.984375, 1370.67578125, 1371.55078125, 1372.6015625, 1373.24609375, 1373.890625, 1374.83203125, 1375.5546875, 1376.2734375, 1376.7421875, 1377.46875, 1377.9453125, 1378.8203125, 1379.52734375, 1380.1875, 1380.76171875, 1381.70703125, 1382.3046875, 1382.9375, 1383.80859375, 1384.3203125, 1384.73046875, 1385.5859375, 1386.35546875, 1387.12109375, 1387.88671875, 1388.48828125, 1388.71875, 1388.93359375, 1389.15625, 1389.6328125, 1390.140625, 1390.94140625, 1391.765625, 1392.01953125, 1394.0390625, 1396.484375, 1410.0546875, 1411.6015625, 1411.6015625

In [13]:
# Medimos el uso de memoria de la función.
mem_usage = memory_usage((q1_time, (local_path + '/' + file_name,)))
print(f"Uso de memoria: {mem_usage}")

Uso de memoria: [1377.5703125, 1377.59375, 1370.2421875, 1370.41796875, 1370.41796875, 1370.41796875, 1370.4296875, 1370.54296875, 1370.703125, 1370.703125, 1370.703125, 1370.703125, 1370.83203125, 1370.83203125, 1370.83203125, 1370.83203125, 1370.83203125, 1370.83203125, 1370.83203125, 1370.83203125, 1370.83203125, 1370.83203125, 1370.83203125, 1370.83203125, 1370.83203125, 1370.83203125, 1370.83203125, 1370.83203125, 1370.83203125, 1370.83203125, 1370.83203125, 1370.83203125, 1370.83203125, 1370.83203125, 1370.83203125, 1370.83203125, 1370.91015625, 1371.4609375, 1371.86328125, 1372.55859375, 1373.13671875, 1373.640625, 1374.0859375, 1374.75390625, 1375.35546875, 1375.81640625, 1376.41015625, 1376.78515625, 1377.4375, 1377.9140625, 1378.35546875, 1379.0546875, 1379.55078125, 1380.05859375, 1380.75, 1381.22265625, 1381.7265625, 1382.1328125, 1382.83203125, 1383.46484375, 1384.02734375, 1384.46875, 1384.9453125, 1385.6171875, 1386.14453125, 1386.70703125, 1387.26171875, 1387.73828125, 

In [14]:
# Medimos el tiempo de ejecución de la función.
profiler = cProfile.Profile()

profiler.enable()
q1_memory(local_path + '/' + file_name)
profiler.disable()

stats = profiler.getstats()
total_time = sum(stat.totaltime for stat in stats)
print("Tiempo total:", total_time)

Tiempo total: 70.98875899200004


In [15]:
# Medimos el tiempo de ejecución de la función.
profiler = cProfile.Profile()

profiler.enable()
q1_time(local_path + '/' + file_name)
profiler.disable()

stats = profiler.getstats()
total_time = sum(stat.totaltime for stat in stats)
print("Tiempo total:", total_time)

Tiempo total: 61.86381752900009


### Query 2: top 10 de emojis

In [16]:
# Medimos el uso de memoria de la función.
mem_usage = memory_usage((q2_memory, (local_path + '/' + file_name,)))
print(f"Uso de memoria: {mem_usage}")

Uso de memoria: [1396.04296875, 1396.04296875, 1396.04296875, 1396.04296875, 1396.04296875, 1396.04296875, 1396.04296875, 1396.04296875, 1396.05859375, 1396.05859375, 1396.05859375, 1396.05859375, 1396.05859375, 1396.05859375, 1396.05859375, 1396.05859375, 1396.05859375, 1396.05859375, 1396.05859375, 1396.05859375, 1396.05859375, 1396.05859375, 1396.05859375, 1396.05859375, 1396.05859375, 1396.05859375, 1396.05859375, 1396.05859375, 1396.05859375, 1396.05859375, 1396.05859375, 1396.05859375, 1396.05859375, 1396.05859375, 1396.05859375, 1396.05859375, 1396.05859375, 1396.05859375, 1396.05859375, 1396.05859375, 1396.05859375, 1396.05859375, 1396.05859375, 1396.05859375, 1396.05859375, 1396.05859375, 1396.05859375, 1396.05859375, 1396.05859375, 1396.05859375, 1396.05859375, 1396.05859375, 1396.05859375, 1396.05859375, 1396.05859375, 1396.05859375, 1396.05859375, 1396.05859375, 1396.05859375, 1396.05859375, 1396.05859375, 1396.05859375, 1396.05859375, 1396.05859375, 1396.05859375, 1396.058

In [17]:
# Medimos el uso de memoria de la función.
mem_usage = memory_usage((q2_time, (local_path + '/' + file_name,)))
print(f"Uso de memoria: {mem_usage}")

Uso de memoria: [1395.078125, 1395.078125, 1395.078125, 1395.078125, 1395.078125, 1395.078125, 1395.078125, 1395.078125, 1395.078125, 1395.078125, 1395.078125, 1395.078125, 1395.078125, 1395.078125, 1395.078125, 1395.078125, 1395.078125, 1395.078125, 1395.078125, 1395.078125, 1395.078125, 1395.078125, 1395.078125, 1395.078125, 1395.078125, 1395.078125, 1395.078125, 1395.078125, 1395.078125, 1395.078125, 1395.078125, 1395.078125, 1395.078125, 1395.078125, 1395.078125, 1395.078125, 1395.078125, 1395.078125, 1395.078125, 1395.078125, 1395.078125, 1395.078125, 1395.078125, 1395.078125, 1395.078125, 1395.078125, 1395.078125, 1395.078125, 1395.078125, 1395.078125, 1395.078125, 1395.078125, 1395.078125, 1395.078125, 1395.078125, 1395.078125, 1395.078125, 1395.078125, 1395.078125, 1395.078125, 1395.078125, 1395.078125, 1395.078125, 1395.078125, 1395.078125, 1395.078125, 1395.078125, 1395.078125, 1395.078125, 1395.078125, 1395.078125, 1395.078125, 1395.078125, 1395.078125, 1395.078125, 1395.078

In [18]:
# Medimos el tiempo de ejecución de la función.
profiler = cProfile.Profile()

profiler.enable()
q2_memory(local_path + '/' + file_name)
profiler.disable()

stats = profiler.getstats()
total_time = sum(stat.totaltime for stat in stats)
print("Tiempo total:", total_time)

Tiempo total: 70.045013981


In [19]:
# Medimos el tiempo de ejecución de la función.
profiler = cProfile.Profile()

profiler.enable()
q2_time(local_path + '/' + file_name)
profiler.disable()

stats = profiler.getstats()
total_time = sum(stat.totaltime for stat in stats)
print("Tiempo total:", total_time)

Tiempo total: 55.01850804400001


### Query 3: top 10 de usuarios mencionados

In [20]:
# Medimos el uso de memoria de la función.
mem_usage = memory_usage((q3_memory, (local_path + '/' + file_name,)))
print(f"Uso de memoria: {mem_usage}")

Uso de memoria: [1389.1953125, 1389.1953125, 1389.1953125, 1389.19921875, 1389.19921875, 1389.19921875, 1389.19921875, 1389.19921875, 1389.19921875, 1389.19921875, 1389.19921875, 1389.19921875, 1389.19921875, 1389.19921875, 1389.19921875, 1389.19921875, 1389.19921875, 1389.19921875, 1389.19921875, 1389.19921875, 1389.19921875, 1389.19921875, 1389.19921875, 1389.19921875, 1389.19921875, 1389.19921875, 1389.19921875, 1389.19921875, 1389.19921875, 1389.19921875, 1389.19921875, 1389.19921875, 1389.19921875, 1389.19921875, 1389.19921875, 1389.19921875, 1389.19921875, 1389.19921875, 1389.19921875, 1389.19921875, 1389.19921875, 1389.19921875, 1389.19921875, 1389.19921875, 1389.19921875, 1389.19921875, 1389.19921875, 1389.19921875, 1389.19921875, 1389.19921875, 1389.19921875, 1389.19921875, 1389.19921875, 1389.19921875, 1389.19921875, 1389.19921875, 1389.19921875, 1389.19921875, 1389.19921875, 1389.19921875, 1389.19921875, 1389.19921875, 1389.19921875, 1389.19921875, 1389.203125, 1389.203125, 

In [21]:
# Medimos el uso de memoria de la función.
mem_usage = memory_usage((q3_time, (local_path + '/' + file_name,)))
print(f"Uso de memoria: {mem_usage}")

Uso de memoria: [1389.203125, 1389.203125, 1389.203125, 1389.203125, 1389.203125, 1389.203125, 1389.203125, 1389.203125, 1389.203125, 1389.203125, 1389.203125, 1389.203125, 1389.203125, 1389.203125, 1389.203125, 1389.203125, 1389.203125, 1389.203125, 1389.203125, 1389.203125, 1389.203125, 1389.203125, 1389.203125, 1389.203125, 1389.203125, 1389.203125, 1389.203125, 1389.203125, 1389.203125, 1389.203125, 1389.203125, 1389.203125, 1389.203125, 1389.203125, 1389.203125, 1389.203125, 1389.203125, 1389.203125, 1389.203125, 1389.203125, 1389.203125, 1389.203125, 1389.203125, 1389.203125, 1389.203125, 1389.203125, 1389.203125, 1389.203125, 1389.203125, 1389.20703125, 1389.20703125, 1389.20703125, 1389.20703125, 1389.20703125, 1389.20703125, 1389.20703125, 1389.20703125, 1389.20703125, 1389.20703125, 1389.20703125, 1389.20703125, 1389.20703125, 1389.20703125, 1389.20703125]


In [22]:
# Medimos el tiempo de ejecución de la función.
profiler = cProfile.Profile()

profiler.enable()
q3_memory(local_path + '/' + file_name)
profiler.disable()

stats = profiler.getstats()
total_time = sum(stat.totaltime for stat in stats)
print("Tiempo total:", total_time)

Tiempo total: 58.26754132300001


In [23]:
# Medimos el tiempo de ejecución de la función.
profiler = cProfile.Profile()

profiler.enable()
q3_time(local_path + '/' + file_name)
profiler.disable()

stats = profiler.getstats()
total_time = sum(stat.totaltime for stat in stats)
print("Tiempo total:", total_time)

Tiempo total: 42.476555834
