# Hispasonic


<br>

A website about musical instruments, recording stuff, and everything related to the world of music. In this place there is also a second-hand market where users sell their musical instruments.

This first part of the project focuses on obtaining relevant ad information, the category I have focused on has been the one that refers to electronic musical instruments.

<br>

Before start obtaining information, the first thing we must know is to understand how the announcement page is organized.

***

- *Image of one of the pages of hispasonic*


![hispa_1e.png](hispa_1e.png)

<br>

<br>

We can see several important things:

- Selected category is on "teclados y sintetizadores".

- Know the number of pages that we are going to analyze to get **all the ads**.



## Function library loading

In [1]:
import requests               # Is an elegant and simple HTTP library for Python
from bs4 import BeautifulSoup # library for pulling data out of HTML and XML files
import re                     # regular expressions operations
import pandas as pd           # A fast, powerful, flexible and easy to use open source data analysis tool
import os                     # A versatile way to use operating system-dependent functionality.
import datetime as dt         # module for manipulating dates and times.
import time

pd.set_option("display.max_rows", None)

### first contact

First of all we must to know if we have a proper response from the server.

In [None]:
%%html 
<style>
table {float:left}
</style>

These are the main possible answers we can get from the server:

|||
|:--|:--|
|**1xx informational response –** |the request was received, continuing process|
|**2xx successful –** |the request was successfully received, understood, and accepted|
|**3xx redirection –** |further action needs to be taken in order to complete the request|
|**4xx client error –** |the request contains bad syntax or cannot be fulfilled|
|**5xx server error –** |the server failed to fulfil an apparently valid request|

In [None]:
# Enter the address and see the response from the server.

url = "https://www.hispasonic.com/anuncios/teclados-sintetizadores"
page = requests.get(url)
page

#### <Response [200]> means correct connection.

## Number of pages to analyze

Once we have communication, we have to know how to determine how to obtain the **total number of pages** to scrap.

![cantidad_iteraciones.png](cantidad_iteraciones.png)

The item is identified as follows.

       'ul', class_='pagination'
       
<br>

Unordered list from a `pagination` class.


To determine the number of iterations, that is, the number of pages on which to extract the information, I must find it inside the html content, find this element within the pages and know what the maximum value is.

We will do this with Beautifulsoup use to extract the contents of an element.

In [None]:
soup = BeautifulSoup(page.content, 'html.parser')
# soup  # all site code

Inside `soup` variable we are looking for `'ul', class_='pagination'`

The following code refers to the 

- **first 5 links of the pages**, 

- the **next 10 pages** and the **last one**, which is the one that interests us.

Save it in a variable of type list, called `unordered_list`

In [None]:
unordered_list = soup.find('ul', class_='pagination') # into variable
unordered_list = unordered_list.contents # tag's children available in a list called .content. from variable to list
unordered_list  
# list

### Exploring `unordered_list`

In [None]:
len(unordered_list) # number of elements

In [None]:
unordered_list[0] # first element

In [None]:
unordered_list[-1] # last element

In [None]:
unordered_list[-2] # this is the one I'm interested in

### Get the value number from `unordered_list`

As what I need is to access the value within the list the strategy that I will follow is the following:

- Convert the list to a text string

- Filter the characters that correspond to numeric values

- Convert those numeric characters to numbers

No puedo que pueda acceder al valor que me interesa unicamente esperando que el valor que quiero esté en la penultima, asi que lo que haré será convertir el contenido de la lista en una cadena de texto y hacer un filtrado de los caracteres con mayor valor mediante expresiones regulares.

Convierto el contenido de `paginas` en cadena de texto.

In [None]:
test = str(unordered_list[-2])
test

`extractMax` function get the string numbers separated by minuscule characters and convert it to integers.

In [None]:
def extractMax(input):
     # get a list of all numbers separated by 
     # lower case characters 
     # \d+ is a regular expression which means
     # one or more digit
     # output will be like ['100','564','365']
    numbers = re.findall('\d+',input)
     # now we need to convert each number into integer
     # int(string) converts string into integer
     # we will map int() function onto all elements 
     # of numbers list
    numbers = map(int,numbers)
    return max(numbers) # devuelve un entero

In [None]:
page_numbers = extractMax(test)
page_numbers

We already have the number of pages that we will have to analyze. 

***

### Getting and save all links.

Iterating on each of the pages we will extract:

- Everything that is a link.


- Those links what I do is stay with what ends in number which is the way to identify those who are ads and what are not.

In [None]:
links_ads = []        # all the ads on the page
listado_enlaces = []  # all the links on the page

pattern="([0-9]{4,9})" # filtering all links with numbers mean choosing the page number

for pagina in range(page_numbers, 0, -1): 
    url = "https://www.hispasonic.com/anuncios/teclados-sintetizadores/pagina{pagina}".format(pagina=pagina)
    print(url)
    
    page = requests.get(url)
    soup = BeautifulSoup(page.content, 'html.parser')

    
    for link in soup.find_all('a'):       # filter everything that is a link on soup variable
        links_ads.append(link.get('href'))
        fecha = soup.find_all('span', class_='miniicon miniicon-date')
        
    
    for s in links_ads:                   # of those links what I do is stay with what ends in number
        if re.search(pattern, s):
            listado_enlaces.append(s)

In [None]:
links_ads[5:20] # example: everything is a link on soup variable

In [None]:
listado_enlaces[5:20] # example: of those links what I do is stay with what ends in number

## Cleaning links

Taking a look it is striking that there are links that are repeated:

            '...
            '/anuncios/korg-vocoder-vc10/866556',
             '/anuncios/korg-vocoder-vc10/866556',
             '/anuncios/polyend-tracker/1057403',
             '/anuncios/polyend-tracker/1057403',
             '/anuncios/trajetas-teclados/949462',
             '/anuncios/trajetas-teclados/949462',
                                             ...',



- Extract the brand name from the url.

<br>


![regex_expression.png](regex_expression.png)

<br>


- Filter the amount of url repeated.

<br>

To get **not repeated url**, we will make a filter with a dictionary.

The main idea is filter the url repeated as key and asign it a synth brand for this unique url as value.

In [2]:
os.chdir('/home/ion/Documentos/albertjimrod/personal_proj_hispasonic/htmls')

In [None]:
diccionario_enlaces = {}

listado_marcas = []

patron_marca = "((?<=anuncios\/)[1-9][a-z]{1,})|((?<=anuncios\/)[a-z]{1,})" # filter brand regex

for enlace in listado_enlaces:
    if enlace not in diccionario_enlaces:  
        try:
            marca = re.search(patron_marca, enlace).group()
            diccionario_enlaces[enlace] = marca
        except AttributeError:
            #marca = re.search(patron_marca, enlace)
            pass # voy a ver si funciona, lo que aprendi del try except

With the dictionary that we have just created we are going to download all the ads locally.

The reason is not to overload the server and run the risk of being banned.

### Downloading locally pages, avoiding the overload of the server.

In [None]:
! pwd

In [None]:
import time

main_path='https://www.hispasonic.com'
local_path = '/home/ion/Documentos/albertjimrod/personal_proj_hispasonic/htmls/'

for enlace in diccionario_enlaces:
    time.sleep(3) # Sleep for 3 seconds
    page = requests.get(main_path + enlace) # https://www.hispasonic.com/anuncios/polyend-tracker/1057403.html
    #print(main_path + enlace)
    enlace = enlace.split("/")  # filtro para poder extraer
    enlace= enlace[2]           # nombre del anuncio

    with open(local_path + enlace + '.html',"w+") as f:
        f.write(page.text)
        
    print(local_path + enlace)

What we are going to do is read each over the ads that we save locally.
<br>

![hispa_4.png](hispa_4.png)

<br>

This is an ad where you can identify the fields we want to get:

**urgente**
**compro**
**cambio**
**vendo**
**price**
**regalo**
**busco**
**reparar**
**piezas**
**marca**
**descripcion**
**user**
**city**
**published**
**expire**
**seen**


**Note:**

The description or title of the advertisement shall be filtered to extract the manufacturer's brand the relevant **model** or **description**.

In [3]:
accion = ["compro","cambio","vendo","regalo","busco","busca",'reparar','piezas']

In [4]:
sintes = {'000': ['000'], '2hp': ['2hp'], '4ms': ['4ms'], 'acces': ['acces'], 'access': ['access'], 'acidlab': ['acidlab'], 'acl': ['acl'], 'akai': ['akai'], 'alembic': ['alembic'], 'alesis': ['alesis'], 'allen&heath': ['allen&heath'], 'analogaudio1': ['analogaudio1'], 'arp': ['arp'], 'arturia': ['arturia'], 'asm': ['asm'], 'atomosynth': ['atomosynth'], 'axoloty': ['axoloty'], 'balaguer': ['balaguer'], 'baloran': ['baloran'], 'befaco': ['befaco'], 'behringer': ['behringer'], 'bitbox': ['bitbox'], 'boss': ['boss'], 'bubblesound': ['bubblesound', 'instruments'], 'buchla': ['buchla'], 'böhm': ['böhm'], 'casio': ['casio'], 'charvel': ['charvel'], 'chronograf': ['chronograf'], 'clavia': ['clavia', 'electro', 'lead', '3', '4', 'micro', 'modular', 'rack', 'stage', 'wave'], 'coast': ['coast'], 'corsynth': ['corsynth'], 'cosmotronic': ['cosmotronic'], 'cre8audio': ['cre8audio'], 'crumar': ['crumar'], 'cyclone': ['cyclone', 'analogic'], 'deepmind': ['deepmind', '12', '6'], 'delptronics': ['delptronics'], 'dexibell': ['dexibell'], 'digitack': ['digitack'], 'divkid': ['divkid'], 'doepfer': ['doepfer'], 'dreadbox': ['dreadbox'], 'dubreq': ['dubreq'], 'dynacord': ['dynacord'], 'e-mu': ['e-mu'], 'e-rm': ['e-rm'], 'e:m:c': ['e:m:c'], 'electribe': ['electribe'], 'electrosmith': ['electrosmith'], 'electrovoice': ['electrovoice'], 'elektron': ['elektron'], 'elka': ['elka'], 'emc': ['emc'], 'emu': ['emu'], 'endorphin.es': ['endorphin.es'], 'endorphines': ['endorphines'], 'ensoniq': ['ensoniq'], 'eowave': ['eowave'], 'epiphone': ['epiphone'], 'eurorack': ['eurorack'], 'eventide': ['eventide'], 'evh': ['evh'], 'evolver': ['evolver'], 'farfisa': ['farfisa'], 'fender': ['fender'], 'fishman': ['fishman'], 'five12': ['five12'], 'fodera': ['fodera'], 'formanta': ['formanta'], 'fretlight': ['fretlight'], 'friedman': ['friedman'], 'futuresonus': ['futuresonus'], 'gator': ['gator'], 'gemini': ['gemini'], 'generalmusic': ['generalmusic'], 'gibson': ['gibson'], 'gieskes': ['gieskes'], 'godin': ['godin'], 'gotharman': ['gotharman'], 'grayscale': ['grayscale'], 'gretsch': ['gretsch'], 'guild': ['guild'], 'hammond': ['hammond'], 'hartmann': ['hartmann'], 'hexinverter': ['hexinverter'], 'hofner': ['hofner'], 'hypersynth': ['hypersynth'], 'höfner': ['höfner'], 'ibanez': ['ibanez'], 'ik': ['ik'], 'instruo': ['instruo'], 'intellijel': ['intellijel', 'designs'], 'iomega': ['iomega'], 'isla': ['isla'], 'jackson': ['jackson'], 'jaspers': ['jaspers'], 'jomox': ['jomox'], 'joranalogue': ['joranalogue'], 'kawai': ['kawai'], 'kenton': ['kenton'], 'ketron': ['ketron'], 'klavis': ['klavis'], 'knobula': ['knobula'], 'komplete': ['komplete'], 'korg': ['korg'], 'kramer': ['kramer'], 'kurzweil': ['kurzweil'], 'l-1': ['l-1'], 'lakland': ['lakland'], 'livid': ['livid'], 'lmntl': ['lmntl'], 'm-audio': ['m-audio'], 'make': ['make', 'noise'], 'malekko': ['malekko', 'heavy', 'industry'], 'maschine': ['maschine'], 'mellotron': ['mellotron'], 'mfb': ['mfb'], 'miditech': ['miditech'], 'modal': ['modal', 'electronics'], 'models': ['models'], 'modor': ['modor'], 'modular': ['modular'], 'modulus': ['modulus'], 'monome': ['monome'], 'moog': ['moog'], 'mordax': ['mordax'], 'mosaic': ['mosaic'], 'mpc': ['mpc'], 'mrseri': ['mrseri'], 'mutant': ['mutant'], 'neutron': ['neutron'], 'nord': ['nord', 'electro', 'lead', '3', '4', 'micro', 'modular', 'rack', 'stage', 'wave'], 'novation': ['novation'], 'numark': ['numark'], 'oberheim': ['oberheim'], 'octatrack': ['octatrack'], 'paratek': ['paratek'], 'pearl': ['pearl'], 'peavey': ['peavey'], 'percussa': ['percussa'], 'polyend': ['polyend'], 'polygraf': ['polygraf'], 'prs': ['prs'], 'qu-bit': ['qu-bit', 'electronix'], 'quasimidi': ['quasimidi'], 'qubit': ['qubit'], 'quiklok': ['quiklok'], 'rhodes': ['rhodes'], 'rickenbacker': ['rickenbacker'], 'roland': ['roland'], 'roli': ['roli'], 'rossum': ['rossum'], 'sanson': ['sanson'], 'schecter': ['schecter'], 'sensel': ['sensel'], 'sequentix': ['sequentix'], 'shakmat': ['shakmat', 'modular'], 'simmons': ['simmons'], 'soma': ['soma'], 'sonicware': ['sonicware'], 'soundforce': ['soundforce'], 'soundmachines': ['soundmachines'], 'spector': ['spector'], 'sputnik': ['sputnik'], 'squarp': ['squarp'], 'squier': ['squier'], 'ssff': ['ssff'], 'stanton': ['stanton'], 'steinberger': ['steinberger'], 'sterling': ['sterling'], 'strymon': ['strymon'], 'studiologic': ['studiologic', 'music'], 'supercritical': ['supercritical'], 'swissonic': ['swissonic'], 'synamodec': ['synamodec'], 'synthrotek': ['synthrotek'], 'synthstrom': ['synthstrom'], 'synthtech': ['synthtech'], 'tascam': ['tascam'], 'taylor': ['taylor'], 'technos': ['technos'], 'transient': ['transient'], 'trogotronic': ['trogotronic'], 'tubbutec': ['tubbutec'], 'u-he': ['u-he'], 'vermona': ['vermona'], 'virus': ['virus'], 'viscount': ['viscount'], 'voicas': ['voicas'], 'volca': ['volca'], 'vox': ['vox'], 'vpme.de': ['vpme.de'], 'waldorf': ['waldorf'], 'warwick': ['warwick'], 'washburn': ['washburn'], 'wurlitzer': ['wurlitzer'], 'yamaha': ['yamaha'], 'yocto': ['yocto'], 'zoom': ['zoom'], '0': ['coast'], '1010': ['music'], 'a-v-p': ['synth'], 'acid': ['rain'], 'addac': ['system'], 'after': ['later'], 'aion': ['modular'], 'ajh': ['synth'], 'allen': ['&'], 'alm': ['busy'], 'alright': ['devices'], 'analogue': ['solutions', 'systems'], 'atomo': ['synth'], 'audio': ['damage'], 'audiophile': ['circuits'], 'bastl': ['instruments'], 'black': ['corporation'], 'blackhole': ['cases'], 'blue': ['lantern'], 'boredbrain': ['music'], 'charlie': ['lab'], 'circuit': ['abbey'], 'club': ['of'], 'custom': ['made'], 'dave': ['jones', 'smith', 'instruments'], 'delta': ['music'], 'denon': ['dj'], 'dnipro': ['modular'], 'e': ['mu'], 'elby': ['designs'], 'electronic': ['music'], 'emblematic': ['systems'], 'empress': ['effects'], 'erica': ['synths'], 'ernie': ['ball'], 'erogenous': ['tones'], 'eskatonic': ['modular'], 'esp': ['ltd'], 'exodus': ['digital'], 'frap': ['tool', 'tools'], 'frequency': ['central'], 'future': ['retro', 'sound'], 'graph': ['tech'], 'hinton': ['instruments'], 'industrial': ['music'], 'io': ['instruments'], 'john': ['bowen'], 'kilpatrick': ['audio'], 'koma': ['elektronik'], 'line': ['6'], 'linn': ['electronics'], 'logan': ['electronics'], 'low-gain': ['electronics'], 'lzx': ['industries'], 'macbeth': ['studio'], 'manhattan': ['analog'], 'manikin': ['electronic'], 'meng': ['qi'], 'michigan': ['synth'], 'micro': ['modular'], 'modbap': ['modular'], 'mutable': ['instruments'], 'nano': ['modules'], 'native': ['instruments'], 'noise': ['engineering'], 'orthogonal': ['devices'], 'patching': ['panda'], 'pioneer': ['dj'], 'pittsburgh': ['modular'], 'plankton': ['electronics'], 'poly': ['effects'], 'ppg': ['ppg'], 'qu': ['bit'], 'radikal': ['technologies'], 'random': ['source'], 'ritual': ['electronics'], 'schlappi': ['engineering'], 'sequential': ['circuits'], 'special': ['waves'], 'spectral': ['audio'], 'steady': ['state'], 'studio': ['electronics'], 'synthesis': ['technology'], 'system': ['80'], 'tall': ['dog'], 'tasty': ['chips'], 'teenage': ['engineering'], 'tenderfoot': ['electronics'], 'tesseract': ['modular'], 'tiptop': ['audio'], 'traveler': ['guitar'], 'udo': ['audio'], 'uno': ['synth'], 'verbos': ['electronics'], 'waves': ['grendel'], 'wersi': ['music'], 'winter': ['modular'], 'wmd': ['ssf'], 'worng': ['electronics'], 'xaoc': ['devices'], 'xor': ['electronics'], 'zeppelin': ['design'], 'zlob': ['modular']}

In [5]:
%%time

compare = ''


texto_descriptivo = ''
list_temp = []

list_compro = []
list_cambio = []
list_vendo = []
list_regalo = []
list_busco = []
list_rebaja = []

list_reparar = []
list_piezas = []
list_urgente = []
list_oferta = []

list_brand = []
list_descripcion = []    
texto_descriptivo_salida = []                                      # esto es lo que se vera como contenido del anuncio

list_price = []
list_user = []
list_city = []
list_published = []
list_expire = []
list_times_seen= [] 
list_date = []
list_dates = []

lista_palabras_para_eliminar = []

def func_compro(clave_func_dict):

    if list_compro[-1] == "0":
        list_vendo.pop(-1)
        list_vendo.append("0")
        list_compro.pop(-1)
        list_compro.append("1")
    else:
        pass


def func_cambio(clave_func_dict):
    if list_cambio[-1] == "0":
        list_vendo.pop(-1)
        list_vendo.append("0")
        list_cambio.pop(-1)
        list_cambio.append("1")

    else:
        pass

def func_vendo(clave_func_dict):
    if list_vendo[-1] == "0":
        list_vendo.pop(-1)
        list_vendo.append("1")
    else:
        pass

def func_regalo(clave_func_dict):
    if list_regalo[-1] == "0":
        list_vendo.pop(-1)
        list_vendo.append("0")
        list_regalo.pop(-1)
        list_regalo.append("1")
    else:
        pass

def func_busco(clave_func_dict):
    if list_busco[-1] == "0":
        list_vendo.pop(-1)
        list_vendo.append("0")
        list_busco.pop(-1)
        list_busco.append("1")
    else:
        pass

def func_reparar(clave_func_dict):
    if list_busco[-1] == "0":
        list_reparar.pop(-1)
        list_reparar.append("1")
    else:
        pass

def func_piezas(clave_func_dict):
    if list_busco[-1] == "0":
        list_piezas.pop(-1)
        list_piezas.append("1")
    else:
        pass

def func_rebaja(clave_func_dict):
    if list_rebaja[-1] == "0":
        list_vendo.pop(-1)
        list_vendo.append("0")
        list_rebaja.pop(-1)
        list_rebaja.append("1")
    else:
        pass

def func_oferta(clave_func_dict):
    if list_oferta[-1] == "0":
        list_oferta.pop(-1)
        list_oferta.append("1")
    else:
        pass



func_dict = {
    "compro":func_compro,
    "cambio":func_cambio,
    "vendo":func_vendo,
    "vende":func_vendo,
    "regalo":func_regalo,
    "busco":func_busco,
    "busca":func_busco,
    "reparar":func_reparar,
    "piezas":func_piezas,
    "rebajado":func_rebaja,
    "rebaja":func_rebaja,
    "oferta":func_oferta
    
}

def remove_compro(clave_func_dict):
    #list_compro.append(clave_func_dict
    list_compro.remove(clave_func_dict)


#rmv_func = {"compro":remove_compro}


def urgente():
    list_urgente.remove('0')
    list_urgente.append("1")


def eliminar_signos(txt):
    descripcion = txt.replace(":"," ")
    descripcion_1 = descripcion.replace("("," ")
    descripcion_2 = descripcion_1.replace(")"," ")
    descripcion_3 = descripcion_2.replace("/"," ")
    descripcion_4 = descripcion_3.replace("."," ").lower()
    descripcion_5 = descripcion_4.split()                   # aqui es cuando descripcion_4 se convierte en una lista como descripcion_5
    return descripcion_5


def default_atributes():
    list_cambio.append("0")
    list_compro.append("0")
    list_urgente.append("0")
    list_vendo.append("1")
    list_regalo.append("0")
    list_reparar.append("0")
    list_piezas.append("0")
    list_busco.append("0")
    list_brand.append("-")


marca_del_sinte = ''

### Inicio


for pagina_anuncio in os.listdir('.'):                                  # '.' hace referencia al directorio donde está apuntando en htmls
    #if "kurzweil" in pagina_anuncio:                               #compro-cambio
    
    with open(pagina_anuncio, 'r') as pagina_bruto:

        pagina_analizar = pagina_bruto.read()
        soup = BeautifulSoup(pagina_analizar, 'html.parser')


        node = soup.find('h1') 

    if  node is not None:                                           # con esto evito que me salte un error relacionado con None
        descripcion = node.text 
        descripcion = eliminar_signos(descripcion)
        #print(descripcion)

        default_atributes()


        for word_1 in descripcion:
            if word_1 in accion:
                func_dict[word_1](word_1)
                lista_palabras_para_eliminar.append(word_1)

            elif word_1 in compare:
                list_temp.append(word_1)

                for marca_sinte in list_temp:
                    marca_del_sinte += marca_sinte + ' '
                    lista_palabras_para_eliminar.append(marca_sinte) # con esto borro el nombre compuesto del sinte en la descripcion.

                #marca_del_sinte = ''
                list_brand.pop(-1)
                list_brand.append(marca_del_sinte)

                compare = '' # en esta variable estaria la segunda parte del nombre por eso la limpio aqui.

            elif word_1 in sintes:
                size_brand = len(sintes[word_1])

                if ((size_brand == 1) and (list_brand != "-")) :
                    list_brand.pop(-1)
                    #lista_palabras_para_eliminar.append(word_1)
                    list_brand.append(word_1)
                    #print(word_1)

                elif ((size_brand == 1) and (list_brand == "-")) :
                    list_descripcion.append(word_1)

                elif size_brand > 1:
                    compare = sintes[word_1]
                    list_temp.append(word_1)

                elif list_brand != "-":
                    list_descripcion.append(word_1)

            marca_del_sinte = ''

        list_temp.clear()

        duplicates = [element for element in lista_palabras_para_eliminar if lista_palabras_para_eliminar.count(element) > 1]
        unique_duplicates = list(set(duplicates))
        #unique_duplicates = ''.join(unique_duplicates)

        #size_unique_duplicates = len(unique_duplicates)
        size_unique_duplicates = len(duplicates)

        if size_unique_duplicates > 1:
            urgente()


        for eliminar in lista_palabras_para_eliminar:
            descripcion.remove(eliminar)

        for palabras in descripcion:
            texto_descriptivo += palabras + ' '


        texto_descriptivo_salida.append(texto_descriptivo)

        #print(f"urgente: {list_urgente[-1]} compro: {list_compro[-1]} cambio: {list_cambio[-1]} vendo: {list_vendo[-1]}  marca: ###{list_brand[-1]} vendo:{list_vendo[-1]} regalo:{list_regalo[-1]} busco: {list_busco[-1]} reparar:{list_reparar[-1]} piezas:{list_piezas[-1]} descripcion: {texto_descriptivo_salida[-1]}")

        #print("\t","\t",texto_descriptivo)
        texto_descriptivo =''


        # --- price

        try:
            price = soup.find('div',class_='ad-price').text
            price = int(price.split()[0])
            list_price.append(price)
        except:
            price = 0
            list_price.append(price)

        # --- user name

        user = soup.find('div',class_='col-lg-7').a.text
        list_user.append(user)

        # --- city

        city = soup.find('div',class_='col-lg-7').div.strong.text
        list_city.append(city)

        # --- published

        #fecha = soup.find('div', class_='miniicon miniicon-date')
        #published = fecha

        published = soup.find('div',class_='col-lg-7').div.text.split()[-5:-2]
        #published = ' '.join(published)
        #list_published.append(published[2])

        for indx in published:
            if '/' in indx:
                print(indx)
                publish = indx
            elif 'hace' in indx:
                a = published.index(indx)
                print(published[a+1] + ' ' + published[a+2])
                publish = published[a+1] + ' ' + published[a+2]
                # published[a] + ' ' + 
            
        list_published.append(publish)

        # --- expire 

        expire = soup.find('div',class_="expira").text.split()[1]
        list_expire.append(expire)

        # --- times seen
        seen = soup.find('div',class_="expira").text.split()[4]
        list_times_seen.append(seen)

        lista_palabras_para_eliminar.clear()
        #print(pagina_anuncio)

08/05/2022
25/07/2022
10/07/2021
01/07/2022
8 horas
03/01/2022
04/07/2022
1 semana
19/07/2022
22/06/2021
25/02/2022
01/04/2021
30/12/2021
22/07/2022
5 días
3 semanas
09/07/2022
20/09/2020
08/06/2022
10/04/2022
09/10/2021
2 días
21/07/2022
09/03/2022
11/06/2022
2 semanas
3 días
30/03/2022
15/02/2021
17/05/2022
09/03/2022
05/06/2022
3 días
23/03/2022
2 semanas
3 semanas
3 horas
17/04/2020
07/02/2022
15/06/2022
19/07/2022
2 semanas
05/05/2022
14/06/2022
02/01/2022
6 días
30/10/2020
1 semana
19/05/2022
07/01/2022
17/02/2021
1 semana
7 días
25/07/2022
19/06/2022
02/06/2022
10/09/2017
07/04/2022
14/07/2022
11/07/2022
4 días
25/05/2022
4 días
07/07/2022
25/07/2022
28/06/2021
07/07/2022
05/07/2022
29/03/2021
10/06/2022
2 días
16/07/2022
2 semanas
1 semana
6 días
11/06/2022
04/07/2022
12/02/2022
20/06/2022
14/02/2022
2 semanas
24/07/2021
03/09/2021
1 semana
19/06/2019
28/05/2022
11/04/2022
23/12/2021
21/07/2022
22/07/2022
19/06/2022
1 semana
09/06/2021
05/02/2022
24/07/2022
3 semanas
2 días
6 d

### Data extraction

The next step is to know what date we have on the day on which the data extraction is done, this an important fact since it can be useful when doing the analysis as to have a record of the ads over time.

In [6]:
hoy = dt.datetime.now()
year=str(hoy.year)

month=str(hoy.month)
day=str(hoy.day)

#date_scrapped = day + '/' + month + '/' + year


date_scrapped = '26' + '/' + '08' + '/' + '2022' 

In [7]:
df = pd.DataFrame({'urgent':list_urgente,
                   'buy':list_compro,
                   'change':list_cambio,
                   'sell':list_vendo,
                   'price':list_price,
                   'gift':list_regalo,
                   'search':list_busco,
                   'repair':list_reparar,
                   'parts':list_piezas,
                   'synt_brand':list_brand,
                   'description':texto_descriptivo_salida,
                   'user':list_user,
                   'city':list_city,
                   'published':list_published,
                   'expire':list_expire,
                   'date_scrapped':date_scrapped,
                   'seen':list_times_seen
                  },index = list(range(1,len(texto_descriptivo_salida)+1)))

#print('Dataframa:\n', df)


df

Unnamed: 0,urgent,buy,change,sell,price,gift,search,repair,parts,synt_brand,description,user,city,published,expire,date_scrapped,seen
1,0,0,0,1,350,0,0,0,0,dreadbox,dreadbox nyx v1,selimorgan,Valencia,08/05/2022,23/09/2022,26/08/2022,668
2,0,0,0,1,350,0,0,0,0,modal electronics,cobalt 5s,Luis Martín (Roma 18),Castellón,25/07/2022,23/09/2022,26/08/2022,312
3,0,0,0,1,900,0,0,0,0,moog,moog little phatty stage 2,Gergö,Baleares,10/07/2021,28/09/2022,26/08/2022,548
4,0,0,0,1,420,0,0,0,0,doepfer,último precio doepfer a-100 6u,manuelariza,Málaga,01/07/2022,30/08/2022,26/08/2022,272
5,0,0,0,1,480,0,0,0,0,mpc,mpc live ssd 250 envío incluido,djoy,Albacete,8 horas,25/10/2022,26/08/2022,78
6,0,0,0,1,105,0,0,0,0,synthrotek,synthrotek fold,moog1406,Girona,03/01/2022,20/09/2022,26/08/2022,478
7,0,0,0,1,1300,0,0,0,0,roland,roland fantom g6,NELUQ,Cuenca,04/07/2022,18/09/2022,26/08/2022,520
8,0,0,1,0,0,0,0,0,0,kurzweil,kurzweil k2500x,chinog,Madrid,1 semana,17/10/2022,26/08/2022,256
9,0,0,0,1,80,0,0,0,0,alesis,alesis qx49 - controlador midi,SINTEXIA,Córdoba,19/07/2022,17/09/2022,26/08/2022,192
10,0,0,0,1,500,0,0,0,0,roland,roland mc-909 sampling groovebox,Rammsysounds,Barcelona,22/06/2021,22/09/2022,26/08/2022,2439


In [8]:
semanas = ['1 semana', '2 semanas', '3 semanas', '4 semanas']
dias = ['1 día', '2 días', '3 días', '4 días', '5 días', '6 días', '7 días']
horas = ['1 hora','2 horas', '3 horas', '4 horas', '5 horas', '6 horas',
        '7 horas','8 horas', '9 horas', '10 horas', '11 horas', '12 horas',
        '13 horas', '14 horas','15 horas', '16 horas', '17 horas', '18 horas',
        '19 horas', '20 horas', '21 horas', '22 horas','23 horas', '24 horas']

In [9]:
minutes=[]
for mint in range(1,61):
    if mint < 2:
        texto = str(mint) + ' minuto'
        minutes.append(texto)
    else:
        texto = str(mint) + ' minutos'
        minutes.append(texto)

In [10]:
copia_df = df.copy()

In [11]:
def alehop(parameter):

    days_inweek = 7

    #date_scrapped = '26' + '/' + '08' + '/' + '2022' 
    
    current_datetime = dt.datetime.strptime(date_scrapped,"%d/%m/%Y") 
    
    
    if parameter in semanas:
        
        num_semana = parameter.split()
        num_semana = int(num_semana[0])
        cambio_semana = semanas[num_semana-1]
        
        dias_semana = (num_semana * days_inweek)
        
        
        fecha_real_semana = current_datetime - dt.timedelta(dias_semana)
        
        fecha_real_semana = fecha_real_semana.strftime("%d/%m/%Y")
        
                
        df['published'] = df['published'].replace( to_replace = cambio_semana, value = fecha_real_semana) #+ ' semana'
        
    elif parameter in dias:
        num_dia = parameter.split()
        num_dias = int(num_dia[0])
        cambio_dia = dias[num_dias-1]

        fecha_real_dia = current_datetime - dt.timedelta(num_dias)
        fecha_real_dia = fecha_real_dia.strftime("%d/%m/%Y")
        
        df['published'] = df['published'].replace( to_replace = cambio_dia, value = fecha_real_dia) #+ ' semana'
        
        
    elif parameter in horas:
        num_hora = parameter.split()
        num_hora = int(num_hora[0])
        #print(num_hora)
        
        if (parameter != '24 horas'):
            hora_real = current_datetime
            hora_real = hora_real.strftime("%d/%m/%Y")
            
            #print(hora_real)
            
            df['published'] = df['published'].replace(to_replace = parameter,
                                              value = hora_real)
        
        elif parameter == '24 horas':
            horas_24 = 1
            hora_real = current_datetime - dt.timedelta(horas_24)
            hora_real = hora_real.strftime("%d/%m/%Y")
            
            
            
            df['published'] = df['published'].replace( to_replace = parameter,value = hora_real ) #+ ' semana'
    
    
    elif parameter in minutes:
        horas_24 = 1
        hora_real = current_datetime - dt.timedelta(horas_24)
        hora_real = hora_real.strftime("%d/%m/%Y")
            
        df['published'] = df['published'].replace( to_replace = parameter,value = hora_real ) #+ ' semana'

In [None]:
df['published'] = copia_df['published'].copy()

In [12]:
df['published']

1      08/05/2022
2      25/07/2022
3      10/07/2021
4      01/07/2022
5         8 horas
6      03/01/2022
7      04/07/2022
8        1 semana
9      19/07/2022
10     22/06/2021
11     25/02/2022
12     01/04/2021
13     30/12/2021
14     22/07/2022
15     22/07/2022
16         5 días
17      3 semanas
18     09/07/2022
19     20/09/2020
20     08/06/2022
21     10/04/2022
22     09/10/2021
23         2 días
24     21/07/2022
25     09/03/2022
26     11/06/2022
27      2 semanas
28         3 días
29     30/03/2022
30     15/02/2021
31     17/05/2022
32     09/03/2022
33     05/06/2022
34         3 días
35     23/03/2022
36      2 semanas
37      3 semanas
38        3 horas
39     17/04/2020
40     07/02/2022
41     15/06/2022
42     19/07/2022
43      2 semanas
44     05/05/2022
45     14/06/2022
46     02/01/2022
47         6 días
48     30/10/2020
49       1 semana
50     19/05/2022
51     07/01/2022
52     17/02/2021
53     17/02/2021
54       1 semana
55         7 días
56     25/

In [16]:
df['published'] = pd.to_datetime(df['published'],dayfirst=True)
df['published']

1     2022-05-08
2     2022-07-25
3     2021-07-10
4     2022-07-01
5     2022-08-26
6     2022-01-03
7     2022-07-04
8     2022-08-19
9     2022-07-19
10    2021-06-22
11    2022-02-25
12    2021-04-01
13    2021-12-30
14    2022-07-22
15    2022-07-22
16    2022-08-21
17    2022-08-05
18    2022-07-09
19    2020-09-20
20    2022-06-08
21    2022-04-10
22    2021-10-09
23    2022-08-24
24    2022-07-21
25    2022-03-09
26    2022-06-11
27    2022-08-12
28    2022-08-23
29    2022-03-30
30    2021-02-15
31    2022-05-17
32    2022-03-09
33    2022-06-05
34    2022-08-23
35    2022-03-23
36    2022-08-12
37    2022-08-05
38    2022-08-26
39    2020-04-17
40    2022-02-07
41    2022-06-15
42    2022-07-19
43    2022-08-12
44    2022-05-05
45    2022-06-14
46    2022-01-02
47    2022-08-20
48    2020-10-30
49    2022-08-19
50    2022-05-19
51    2022-01-07
52    2021-02-17
53    2021-02-17
54    2022-08-19
55    2022-08-19
56    2022-07-25
57    2022-06-19
58    2022-06-02
59    2017-09-

In [13]:
df['published'].apply(alehop)

1      None
2      None
3      None
4      None
5      None
6      None
7      None
8      None
9      None
10     None
11     None
12     None
13     None
14     None
15     None
16     None
17     None
18     None
19     None
20     None
21     None
22     None
23     None
24     None
25     None
26     None
27     None
28     None
29     None
30     None
31     None
32     None
33     None
34     None
35     None
36     None
37     None
38     None
39     None
40     None
41     None
42     None
43     None
44     None
45     None
46     None
47     None
48     None
49     None
50     None
51     None
52     None
53     None
54     None
55     None
56     None
57     None
58     None
59     None
60     None
61     None
62     None
63     None
64     None
65     None
66     None
67     None
68     None
69     None
70     None
71     None
72     None
73     None
74     None
75     None
76     None
77     None
78     None
79     None
80     None
81     None
82     None
83     None
84  

In [None]:
df['published'].unique()

In [17]:
pd.set_option("display.max_columns", None)

In [14]:
df['published']

1      08/05/2022
2      25/07/2022
3      10/07/2021
4      01/07/2022
5      26/08/2022
6      03/01/2022
7      04/07/2022
8      19/08/2022
9      19/07/2022
10     22/06/2021
11     25/02/2022
12     01/04/2021
13     30/12/2021
14     22/07/2022
15     22/07/2022
16     21/08/2022
17     05/08/2022
18     09/07/2022
19     20/09/2020
20     08/06/2022
21     10/04/2022
22     09/10/2021
23     24/08/2022
24     21/07/2022
25     09/03/2022
26     11/06/2022
27     12/08/2022
28     23/08/2022
29     30/03/2022
30     15/02/2021
31     17/05/2022
32     09/03/2022
33     05/06/2022
34     23/08/2022
35     23/03/2022
36     12/08/2022
37     05/08/2022
38     26/08/2022
39     17/04/2020
40     07/02/2022
41     15/06/2022
42     19/07/2022
43     12/08/2022
44     05/05/2022
45     14/06/2022
46     02/01/2022
47     20/08/2022
48     30/10/2020
49     19/08/2022
50     19/05/2022
51     07/01/2022
52     17/02/2021
53     17/02/2021
54     19/08/2022
55     19/08/2022
56     25/

In [18]:
df

Unnamed: 0,urgent,buy,change,sell,price,gift,search,repair,parts,synt_brand,description,user,city,published,expire,date_scrapped,seen
1,0,0,0,1,350,0,0,0,0,dreadbox,dreadbox nyx v1,selimorgan,Valencia,2022-05-08,23/09/2022,26/08/2022,668
2,0,0,0,1,350,0,0,0,0,modal electronics,cobalt 5s,Luis Martín (Roma 18),Castellón,2022-07-25,23/09/2022,26/08/2022,312
3,0,0,0,1,900,0,0,0,0,moog,moog little phatty stage 2,Gergö,Baleares,2021-07-10,28/09/2022,26/08/2022,548
4,0,0,0,1,420,0,0,0,0,doepfer,último precio doepfer a-100 6u,manuelariza,Málaga,2022-07-01,30/08/2022,26/08/2022,272
5,0,0,0,1,480,0,0,0,0,mpc,mpc live ssd 250 envío incluido,djoy,Albacete,2022-08-26,25/10/2022,26/08/2022,78
6,0,0,0,1,105,0,0,0,0,synthrotek,synthrotek fold,moog1406,Girona,2022-01-03,20/09/2022,26/08/2022,478
7,0,0,0,1,1300,0,0,0,0,roland,roland fantom g6,NELUQ,Cuenca,2022-07-04,18/09/2022,26/08/2022,520
8,0,0,1,0,0,0,0,0,0,kurzweil,kurzweil k2500x,chinog,Madrid,2022-08-19,17/10/2022,26/08/2022,256
9,0,0,0,1,80,0,0,0,0,alesis,alesis qx49 - controlador midi,SINTEXIA,Córdoba,2022-07-19,17/09/2022,26/08/2022,192
10,0,0,0,1,500,0,0,0,0,roland,roland mc-909 sampling groovebox,Rammsysounds,Barcelona,2021-06-22,22/09/2022,26/08/2022,2439


In [None]:
df.to_csv('fixed_feliz_cumple_dataframe.csv', index = True)
#df.to_csv('hispa_dataframe.csv'.format(date=date_scrapped), index = False)

In [None]:
! pwd