<a href="https://colab.research.google.com/github/VictorNGomes/Guided-Project-Building-Fast-Queries-on-a-CSV/blob/main/Tarefa_05_Unidade_1_ED_II.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Guided Project: Building Fast Queries on a CSV
- Adaptado do projeto guiado do curso Algorithm Complexity da plataforma [Dataquest](https://dataquest.io)
- Usamos o dataset The Reddit Climate Change Dataset obtido por meio do [Kaggle](https://www.kaggle.com/datasets/pavellexyr/the-reddit-climate-change-dataset)
- Desenvolvido por:
  - Gabriel Lins ([GitHub](https://github.com/gabrielblins))
  - Victor Gomes ([GitHub](https://github.com/victorngomes))

## Obtendo dados do Kaggle

In [1]:
!mkdir /root/.kaggle

In [2]:
!mv kaggle.json /root/.kaggle

In [3]:
!kaggle datasets download -d pavellexyr/the-reddit-climate-change-dataset

Downloading the-reddit-climate-change-dataset.zip to /content
 99% 1.49G/1.50G [00:13<00:00, 133MB/s]
100% 1.50G/1.50G [00:13<00:00, 116MB/s]


In [4]:
!unzip the-reddit-climate-change-dataset.zip

Archive:  the-reddit-climate-change-dataset.zip
  inflating: the-reddit-climate-change-dataset-comments.csv  
  inflating: the-reddit-climate-change-dataset-posts.csv  


In [5]:
!rm the-reddit-climate-change-dataset.zip

## Definição das funções e classes

In [7]:
import csv
  
def read_csv(file_path):
    with open(file_path) as f: 
        reader = csv.reader(f)
        rows = list(reader)

    for row in rows[1:]:
      if len(row[-2]) is 0:
        row[-2] = 0.0
      else:
        row[-2] = float(row[-2])

    return rows


In [8]:
class Searcher():
    def __init__(self, csv):
        self.header = csv[0]         
        self.rows = csv[1:]
        self.id_to_row = {}
        for row in self.rows:
            self.id_to_row[row[1]] = row  

    def get_comment_from_id(self, id):   
        for row in self.rows:
            if row[self.header.index('id')] == id:
                return row
        return None

    def get_comment_from_id_fast(self,id):
        if id in self.id_to_row.keys():
            return self.id_to_row[id] 
        return None                       
      
    def get_sentiment_in_range(self, bottom ,upper):
        return [row for row in self.rows if row[-2] >= bottom and row[-2] <= upper]
  
    def twoScoreSum(self, targetSum):    
        for row1 in self.rows:                     
            for row2 in self.rows:
                if float(row1[-1]) + float(row2[-1]) == targetSum:
                    return [row1, row2]
        return -1          

    def twoScoreSum_fast(self,targetSum):
        results = {}
        for row in self.rows:
            y = targetSum - float(row[-1])
            if y in results:
                return [results[y], row]
            else:
                results[float(row[-1])] = row
        return -1

## *Teste dos métodos implementados*

In [9]:
data = read_csv('the-reddit-climate-change-dataset-comments.csv')


In [10]:
srch = Searcher(data)

In [10]:
srch.rows[56]

['comment',
 'iml9tsw',
 '2g3blu',
 'coronavirusdownunder',
 'false',
 '1661988796',
 'https://old.reddit.com/r/CoronavirusDownunder/comments/x27ii7/antivaxxers_lose_in_federal_court_must_pay_200k/iml9tsw/',
 'Just because someone has donated to a university doesn\'t mean they have any say over the results of the research. Any scientist that intentionally publishes misleading information will destroy their entire career. The fact is that regardless of the results of any scientific research, there will be plenty of science left to do. \n\nIf we suddenly discovered that covid vaccines were unsafe (which they aren\'t), it\'s not like everyone would lose their jobs - in fact they\'d have more work to do, and would receive even more funding to develop a vaccine that is safe. If they found that covid vaccines are perfectly safe (which they are), then there would still be more work to do developing future vaccines to be more effective, especially against newer variants.\n\nEssentially, the li

In [11]:
srch.header

['type',
 'id',
 'subreddit.id',
 'subreddit.name',
 'subreddit.nsfw',
 'created_utc',
 'permalink',
 'body',
 'sentiment',
 'score']

In [12]:
srch.get_comment_from_id('imld6cb')

['comment',
 'imld6cb',
 '2qi09',
 'sacramento',
 'false',
 '1661990278',
 'https://old.reddit.com/r/Sacramento/comments/x2ruqy/hey_guyz_this_is_a_tough_one_why_do_you_think/imld6cb/',
 "Not just Sacramento. It's actually happening all over the world. Climate change is real, believe it or not.",
 0.0,
 '4']

In [13]:
srch.get_comment_from_id('iml9tsw')

['comment',
 'iml9tsw',
 '2g3blu',
 'coronavirusdownunder',
 'false',
 '1661988796',
 'https://old.reddit.com/r/CoronavirusDownunder/comments/x27ii7/antivaxxers_lose_in_federal_court_must_pay_200k/iml9tsw/',
 'Just because someone has donated to a university doesn\'t mean they have any say over the results of the research. Any scientist that intentionally publishes misleading information will destroy their entire career. The fact is that regardless of the results of any scientific research, there will be plenty of science left to do. \n\nIf we suddenly discovered that covid vaccines were unsafe (which they aren\'t), it\'s not like everyone would lose their jobs - in fact they\'d have more work to do, and would receive even more funding to develop a vaccine that is safe. If they found that covid vaccines are perfectly safe (which they are), then there would still be more work to do developing future vaccines to be more effective, especially against newer variants.\n\nEssentially, the li

In [14]:
srch.get_comment_from_id_fast('imld6cb')

['comment',
 'imld6cb',
 '2qi09',
 'sacramento',
 'false',
 '1661990278',
 'https://old.reddit.com/r/Sacramento/comments/x2ruqy/hey_guyz_this_is_a_tough_one_why_do_you_think/imld6cb/',
 "Not just Sacramento. It's actually happening all over the world. Climate change is real, believe it or not.",
 0.0,
 '4']

## Comparando desempenho


- Tempo de execuação para a função *get_comment_from_id* e *get_comment_from_id_fast*

In [20]:
import time                                                         
from random import shuffle                                                    

ids = [rows[1] for rows in srch.rows[:15]]
shuffle(ids)
print('################################################################')
for id in ids:
    print(id)
    %timeit -n 100 srch.get_comment_from_id(id)
    print('################################################################')



################################################################
imldbeh
600 ns ± 50.6 ns per loop (mean ± std. dev. of 7 runs, 100 loops each)
################################################################
imlddn9
384 ns ± 19.7 ns per loop (mean ± std. dev. of 7 runs, 100 loops each)
################################################################
imld0kj
1.15 µs ± 39.2 ns per loop (mean ± std. dev. of 7 runs, 100 loops each)
################################################################
imlctc0
1.56 µs ± 74 ns per loop (mean ± std. dev. of 7 runs, 100 loops each)
################################################################
imld6cb
1.01 µs ± 124 ns per loop (mean ± std. dev. of 7 runs, 100 loops each)
################################################################
imldado
926 ns ± 381 ns per loop (mean ± std. dev. of 7 runs, 100 loops each)
################################################################
imlch0h
2.34 µs ± 137 ns per loop (mean ± std. dev. of 7 runs, 100 loops

In [16]:
print('################################################################')
for id in ids:
    print(id)
    %timeit -n 100 srch.get_comment_from_id_fast(id)
    print('################################################################')

################################################################
imldbeh
482 ns ± 75.7 ns per loop (mean ± std. dev. of 7 runs, 100 loops each)
################################################################
imlcpab
245 ns ± 16.1 ns per loop (mean ± std. dev. of 7 runs, 100 loops each)
################################################################
imlc7mr
249 ns ± 25.4 ns per loop (mean ± std. dev. of 7 runs, 100 loops each)
################################################################
imlcfv2
272 ns ± 47 ns per loop (mean ± std. dev. of 7 runs, 100 loops each)
################################################################
imlcm07
299 ns ± 149 ns per loop (mean ± std. dev. of 7 runs, 100 loops each)
################################################################
imlctri
244 ns ± 10.9 ns per loop (mean ± std. dev. of 7 runs, 100 loops each)
################################################################
imld6cb
242 ns ± 8.39 ns per loop (mean ± std. dev. of 7 runs, 100 loops e

- Tempo de execuação para as funções twoScoreSum e thoScoreSum_fast

In [17]:
parametros = [4,27,70,100]
print('################################################################')
for param in parametros:
    print(param)
    %timeit -n 100 srch.twoScoreSum(param)
    print('################################################################') 

################################################################
4
727 ns ± 40.4 ns per loop (mean ± std. dev. of 7 runs, 100 loops each)
################################################################
27
31.5 µs ± 14.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
################################################################
70
129 µs ± 22.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
################################################################
100
403 µs ± 20.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
################################################################


In [19]:
print('################################################################')
for param in parametros:
    print(param)
    %timeit -n 100 srch.twoScoreSum_fast(param)
    print('################################################################') 

################################################################
4
1.93 µs ± 167 ns per loop (mean ± std. dev. of 7 runs, 100 loops each)
################################################################
27
24.4 µs ± 5.25 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
################################################################
70
44 µs ± 3.17 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
################################################################
100
172 µs ± 11.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
################################################################


## Implementando testes unitários com Pytest

In [21]:
!pip install pytest

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [1]:
%%file test_data.py
import pytest
import csv
import pickle
  
def read_csv(file_path):
    with open(file_path) as f: 
        reader = csv.reader(f)
        rows = list(reader)

    for row in rows[1:]:
      if len(row[-2]) is 0:
        row[-2] = 0.0
      else:
        row[-2] = float(row[-2])

    return rows

class Searcher():
    def __init__(self, csv):
        self.header = csv[0]         
        self.rows = csv[1:]
        self.id_to_row = {}
        for row in self.rows:
            self.id_to_row[row[1]] = row  

    def get_comment_from_id(self, id):   
        for row in self.rows:
            if row[self.header.index('id')] == id:
                return row
        return None

    def get_comment_from_id_fast(self,id):
        if id in self.id_to_row.keys():
            return self.id_to_row[id] 
        return None                      
      
    def get_sentiment_in_range(self, bottom ,upper):
        return [row for row in self.rows if row[-2] >= bottom and row[-2] <= upper]
  
    def twoScoreSum(self, targetSum):    
        for row1 in self.rows:                     
            for row2 in self.rows:
                if float(row1[-1]) + float(row2[-1]) == targetSum:
                    return [row1, row2]
        return -1          

    def twoScoreSum_fast(self,targetSum):
        results = {}
        for row in self.rows:
            y = targetSum - float(row[-1])
            if y in results:
                return [results[y], row]
            else:
                results[float(row[-1])] = row
        return -1

arquivo = 'the-reddit-climate-change-dataset-comments.csv'

twoscoresum_result = [
    ['comment',
    'imlddn9',
    '2qh3l',
    'news',
    'false',
    '1661990368',
    'https://old.reddit.com/r/news/comments/x2cszk/us_life_expectancy_down_for_secondstraight_year/imlddn9/',
    'Yeah but what the above commenter is saying is their base doesn’t want any of that. They detest all of those things, even the small gradual changes. Investing in nuclear energy is a tacit acknowledgement of man made climate change. Any acknowledgement or concession and they will be primaried out in a minute',
    0.5719,
    '2'],
    ['comment',
    'iml68pg',
    '2qh1n',
    'environment',
    'false',
    '1661987214',
    'https://old.reddit.com/r/environment/comments/x2d6mk/climate_scientists_urge_more_civil_disobedience/iml68pg/',
    "I'm all for protests as long as they don't involve damage of people or property. Despite the fact that I value this world and its living things more than any and all property if we go around smashing stuff it's just going to corrupt the important message and turn people against it and us.\n\nEveryone needs to come together on climate change and we need to educate climate change deniers rather than shout them down or insult them and especially never try to hurt them.\n\nA united human race is what is required to help the Earth and we won't get that if we attack each other.",
    0.5725,
    '-1']
]

twoscoresumfast_result = [
    ['comment',
    'imldado',
    '2qhma',
    'newzealand',
    'false',
    '1661990327',
    'https://old.reddit.com/r/newzealand/comments/x28xci/long_rant_pessimistic_asf_and_feel_like_were/imldado/',
    "I'm honestly waiting for climate change and the impacts of that to kick some fucking sense into people. But who am I kidding itll still just be more of the poor suffering while the rich claim victim hood for handouts while letting us all starve. Its honestly hard some days to not just give up, and I truly wonder if and when anything will ever actually be done.",
    -0.1143,
    '1'],
    ['comment',
    'imld6cb',
    '2qi09',
    'sacramento',
    'false',
    '1661990278',
    'https://old.reddit.com/r/Sacramento/comments/x2ruqy/hey_guyz_this_is_a_tough_one_why_do_you_think/imld6cb/',
    "Not just Sacramento. It's actually happening all over the world. Climate change is real, believe it or not.",
    0.0,
    '4']
]

@pytest.fixture(scope='session')
def dataset():
    data = read_csv(arquivo)
    search = Searcher(data)
    return search

def test_get_comment_from_id(dataset):
    '''
    Testa a função get_comment_from_id dado um id válido
    '''
    assert dataset.get_comment_from_id('imld6cb') == ['comment',
                                                      'imld6cb',
                                                      '2qi09',
                                                      'sacramento',
                                                      'false',
                                                      '1661990278',
                                                      'https://old.reddit.com/r/Sacramento/comments/x2ruqy/hey_guyz_this_is_a_tough_one_why_do_you_think/imld6cb/',
                                                      "Not just Sacramento. It's actually happening all over the world. Climate change is real, believe it or not.",
                                                      0.0,
                                                      '4']

def test_get_comment_from_id_notfound(dataset):
    '''
    Testa a função get_comment_from_id dado um id inválido
    '''
    assert dataset.get_comment_from_id('aabb1122') == None

def test_get_comment_from_id_fast(dataset):
    '''
    Testa a função get_comment_from_id_fast dado um id válido
    '''
    assert dataset.get_comment_from_id_fast('imld6cb') == ['comment',
                                                           'imld6cb',
                                                           '2qi09',
                                                           'sacramento',
                                                           'false',
                                                           '1661990278',
                                                           'https://old.reddit.com/r/Sacramento/comments/x2ruqy/hey_guyz_this_is_a_tough_one_why_do_you_think/imld6cb/',
                                                           "Not just Sacramento. It's actually happening all over the world. Climate change is real, believe it or not.",
                                                           0.0,
                                                           '4']

def test_get_comment_from_id_fast_notfound(dataset):
    '''
    Testa a função get_comment_from_id_fast dado um id inválido
    '''
    assert dataset.get_comment_from_id_fast('aabb1122') == None

def test_get_sentiment_in_range(dataset):
    '''
    Testa a função get_sentiment_in_range dado um range qualquer
    '''
    assert len(dataset.get_sentiment_in_range(-0.0101, -0.01)) == 54

def test_twoScoreSum(dataset):

    assert dataset.twoScoreSum(1) == twoscoresum_result

# def test_twoScoreSum_notfound(dataset):

#     assert dataset.twoScoreSum(50000) == -1

def test_twoScoreSum_fast(dataset):

    assert dataset.twoScoreSum_fast(5) == twoscoresumfast_result

def test_twoScoreSum_fast_notfound(dataset):

    assert dataset.twoScoreSum_fast(50000) == -1

Overwriting test_data.py


In [2]:
!pytest test_data.py -vv

platform linux -- Python 3.7.14, pytest-3.6.4, py-1.11.0, pluggy-0.7.1 -- /usr/bin/python3
cachedir: .pytest_cache
rootdir: /content, inifile:
plugins: typeguard-2.7.1
[1mcollecting 0 items                                                             [0m[1mcollecting 8 items                                                             [0m[1mcollected 8 items                                                              [0m

test_data.py::test_get_comment_from_id [32mPASSED[0m[36m                            [ 12%][0m
test_data.py::test_get_comment_from_id_notfound [32mPASSED[0m[36m                   [ 25%][0m
test_data.py::test_get_comment_from_id_fast [32mPASSED[0m[36m                       [ 37%][0m
test_data.py::test_get_comment_from_id_fast_notfound [32mPASSED[0m[36m              [ 50%][0m
test_data.py::test_get_sentiment_in_range [32mPASSED[0m[36m                         [ 62%][0m
test_data.py::test_twoScoreSum [32mPASSED[0m[36m                         