# Hispasonic


<br>

A website about musical instruments, recording stuff, and everything related to the world of music. In this place there is also a second-hand market where users sell their musical instruments.

This first part of the project focuses on obtaining relevant ad information, the category I have focused on has been the one that refers to electronic musical instruments.

<br>

Before start obtaining information, the first thing we must know is to understand how the announcement page is organized.

***

- *Image of one of the pages of hispasonic*


![hispa_1e.png](hispa_1e.png)

<br>

<br>

We can see several important things:

- Selected category is on "teclados y sintetizadores".

- Know the number of pages that we are going to analyze to get **all the ads**.



## Function library loading

In [1]:
import requests               # Is an elegant and simple HTTP library for Python
from bs4 import BeautifulSoup # library for pulling data out of HTML and XML files
import re                     # regular expressions operations
import pandas as pd           # A fast, powerful, flexible and easy to use open source data analysis tool
import os                     # A versatile way to use operating system-dependent functionality.
import datetime as dt         # module for manipulating dates and times.
import time

pd.set_option("display.max_rows", None)

### first contact

First of all we must to know if we have a proper response from the server.

In [2]:
%%html 
<style>
table {float:left}
</style>

These are the main possible answers we can get from the server:

|||
|:--|:--|
|**1xx informational response –** |the request was received, continuing process|
|**2xx successful –** |the request was successfully received, understood, and accepted|
|**3xx redirection –** |further action needs to be taken in order to complete the request|
|**4xx client error –** |the request contains bad syntax or cannot be fulfilled|
|**5xx server error –** |the server failed to fulfil an apparently valid request|

In [None]:
# Enter the address and see the response from the server.

url = "https://www.hispasonic.com/anuncios/teclados-sintetizadores"
page = requests.get(url)
page

#### <Response [200]> means correct connection.

## Number of pages to analyze

Once we have communication, we have to know how to determine how to obtain the **total number of pages** to scrap.

![cantidad_iteraciones.png](cantidad_iteraciones.png)

The item is identified as follows.

       'ul', class_='pagination'
       
<br>

Unordered list from a `pagination` class.


To determine the number of iterations, that is, the number of pages on which to extract the information, I must find it inside the html content, find this element within the pages and know what the maximum value is.

We will do this with Beautifulsoup use to extract the contents of an element.

In [None]:
soup = BeautifulSoup(page.content, 'html.parser')
# soup  # all site code

Inside `soup` variable we are looking for `'ul', class_='pagination'`

The following code refers to the 

- **first 5 links of the pages**, 

- the **next 10 pages** and the **last one**, which is the one that interests us.

Save it in a variable of type list, called `unordered_list`

In [None]:
unordered_list = soup.find('ul', class_='pagination') # into variable
unordered_list = unordered_list.contents # tag's children available in a list called .content. from variable to list
unordered_list                           # list

### Exploring `unordered_list`

In [None]:
len(unordered_list) # number of elements

In [None]:
unordered_list[0] # first element

In [None]:
unordered_list[-1] # last element

In [None]:
unordered_list[-2] # this is the one I'm interested in

### Get the value number from `unordered_list`

As what I need is to access the value within the list the strategy that I will follow is the following:

- Convert the list to a text string

- Filter the characters that correspond to numeric values

- Convert those numeric characters to numbers

No puedo que pueda acceder al valor que me interesa unicamente esperando que el valor que quiero esté en la penultima, asi que lo que haré será convertir el contenido de la lista en una cadena de texto y hacer un filtrado de los caracteres con mayor valor mediante expresiones regulares.

Convierto el contenido de `paginas` en cadena de texto.

In [None]:
test = str(unordered_list[-2])
test

`extractMax` function get the string numbers separated by minuscule characters and convert it to integers.

In [None]:
def extractMax(input):
     # get a list of all numbers separated by 
     # lower case characters 
     # \d+ is a regular expression which means
     # one or more digit
     # output will be like ['100','564','365']
    numbers = re.findall('\d+',input)
     # now we need to convert each number into integer
     # int(string) converts string into integer
     # we will map int() function onto all elements 
     # of numbers list
    numbers = map(int,numbers)
    return max(numbers) # devuelve un entero

In [None]:
page_numbers = extractMax(test)
page_numbers

We already have the number of pages that we will have to analyze. 

***

### Getting and save all links.

Iterating on each of the pages we will extract:

- Everything that is a link.


- Those links what I do is stay with what ends in number which is the way to identify those who are ads and what are not.

In [None]:
links_ads = []        # all the ads on the page
listado_enlaces = []  # all the links on the page

pattern="([0-9]{4,9})" # filtering all links with numbers mean choosing the page number

for pagina in range(page_numbers, 0, -1): 
    url = "https://www.hispasonic.com/anuncios/teclados-sintetizadores/pagina{pagina}".format(pagina=pagina)
    print(url)
    
    page = requests.get(url)
    soup = BeautifulSoup(page.content, 'html.parser')

    
    for link in soup.find_all('a'):       # filter everything that is a link on soup variable
        links_ads.append(link.get('href'))
        
    
    for s in links_ads:                   # of those links what I do is stay with what ends in number
        if re.search(pattern, s):
            listado_enlaces.append(s)

In [None]:
links_ads[5:20] # example: everything is a link on soup variable

In [None]:
listado_enlaces[5:20] # example: of those links what I do is stay with what ends in number

## Cleaning links

Taking a look it is striking that there are links that are repeated:

            '...
            '/anuncios/korg-vocoder-vc10/866556',
             '/anuncios/korg-vocoder-vc10/866556',
             '/anuncios/polyend-tracker/1057403',
             '/anuncios/polyend-tracker/1057403',
             '/anuncios/trajetas-teclados/949462',
             '/anuncios/trajetas-teclados/949462',
                                             ...',



- Extract the brand name from the url.

<br>


![regex_expression.png](regex_expression.png)

<br>


- Filter the amount of url repeated.

<br>

To get **not repeated url**, we will make a filter with a dictionary.

The main idea is filter the url repeated as key and asign it a synth brand for this unique url as value.

In [None]:
os.chdir('/home/ion/Documentos/albertjimrod/personal_proj_hispasonic/htmls')

In [None]:
diccionario_enlaces = {}

listado_marcas = []

patron_marca = "((?<=anuncios\/)[1-9][a-z]{1,})|((?<=anuncios\/)[a-z]{1,})" # filter brand regex

for enlace in listado_enlaces:
    if enlace not in diccionario_enlaces:  
        try:
            marca = re.search(patron_marca, enlace).group()
            diccionario_enlaces[enlace] = marca
        except AttributeError:
            #marca = re.search(patron_marca, enlace)
            pass # voy a ver si funciona, lo que aprendi del try except

With the dictionary that we have just created we are going to download all the ads locally.

The reason is not to overload the server and run the risk of being banned.

In [None]:
! pwd # checking where the focus path is.

## This is where we download locally the pages with which we are going to work, avoiding the overload of the server

In [None]:
import time

main_path='https://www.hispasonic.com'
local_path = '/home/ion/Documentos/albertjimrod/personal_proj_hispasonic/htmls'

for enlace in diccionario_enlaces:
    time.sleep(2) # Sleep for 3 seconds
    page = requests.get(main_path + enlace) # https://www.hispasonic.com/anuncios/polyend-tracker/1057403.html
    
    enlace = enlace.split("/")  # filtro para poder extraer
    enlace= enlace[2]           # nombre del anuncio

    with open(local_path + enlace + '.html',"w+") as f:
        f.write(page.text)

In [5]:
os.getcwd() # checking path is set properly

'/home/ion/Documentos/albertjimrod/personal_proj_hispasonic/htmls'

In [6]:
! ls

htmls4ms-quad-pingable-lfoquad-clock-distributor.html
htmls4ms-row-power.html
htmls4ms-stereo-triggered-sampler.html
htmlsableton-push-primera-version.html
htmlsaccess-matrix-programmer.html
htmlsaccess-virus-kc.html
htmlsaccess-virus-snow.html
htmlsaccess-virus-ti-2-polar.html
htmlsacces-virus-c.html
htmlsacidlab-drumatix.html
htmlsaddac-804-audio-integrator.html
htmlsakai-advance-49.html
htmlsakai-cd-rom-sound-library-mpc2000xl-vol1.html
htmlsakai-eb-16-eb16-fx-card-2000-xl.html
htmlsakai-force.html
htmlsakai-lpd-8.html
htmlsakai-mpc-live-2.html
htmlsakai-mpc-live.html
htmlsakai-mpc-one.html
htmlsakai-mpc-renaissance.html
htmlsakai-mpk61.html
htmlsakai-mpk-88.html
htmlsakai-mpk88.html
htmlsakai-pro-mpd-32.html
htmlsakai-s3000xl.html
htmlsakai-s3000xl-tarjeta-eb16-zip-omega-manuales-cableado.html
htmlsakai-s6000.html
htmlsakai-sg01v.html
htmlsakai-x7000.html
htmlsakai-xe8-midi-drum-expander.html
htmlsakemie-castle-alm-busy-circuits.html
htmlsalesis-ion.h

What we are going to do is read each over the ads that we save locally.
<br>

![hispa_4.png](hispa_4.png)

<br>

This is an ad where you can identify the fields we want to get:

**urgente**
**compro**
**cambio**
**vendo**
**price**
**regalo**
**busco**
**reparar**
**piezas**
**marca**
**descripcion**
**user**
**city**
**published**
**expire**
**seen**


**Note:**

The description or title of the advertisement shall be filtered to extract the manufacturer's brand the relevant **model** or **description**.

In [None]:
os.chdir('/home/ion/Documentos/albertjimrod/personal_proj_hispasonic/htmls')

In [None]:
! ls

In [7]:
accion = ["compro","cambio","vendo","regalo","busco","busca",'reparar','piezas']

#accesorios = ["maleta","flightcase","sonidos","rack","enracado","enrackado","racks"]

#instrumento = ['sintetizador','piano','teclado','modulo']
                

sintes = {'000': ['000'], '2hp': ['2hp'], '4ms': ['4ms'], 'acces': ['acces'], 'access': ['access'], 'acidlab': ['acidlab'], 'acl': ['acl'], 'akai': ['akai'], 'alembic': ['alembic'], 'alesis': ['alesis'], 'allen&heath': ['allen&heath'], 'analogaudio1': ['analogaudio1'], 'arp': ['arp'], 'arturia': ['arturia'], 'asm': ['asm'], 'atomosynth': ['atomosynth'], 'axoloty': ['axoloty'], 'balaguer': ['balaguer'], 'baloran': ['baloran'], 'befaco': ['befaco'], 'behringer': ['behringer'], 'bitbox': ['bitbox'], 'boss': ['boss'], 'bubblesound': ['bubblesound', 'instruments'], 'buchla': ['buchla'], 'böhm': ['böhm'], 'casio': ['casio'], 'charvel': ['charvel'], 'chronograf': ['chronograf'], 'clavia': ['clavia', 'electro', 'lead', '3', '4', 'micro', 'modular', 'rack', 'stage', 'wave'], 'coast': ['coast'], 'corsynth': ['corsynth'], 'cosmotronic': ['cosmotronic'], 'cre8audio': ['cre8audio'], 'crumar': ['crumar'], 'cyclone': ['cyclone', 'analogic'], 'deepmind': ['deepmind', '12', '6'], 'delptronics': ['delptronics'], 'dexibell': ['dexibell'], 'digitack': ['digitack'], 'divkid': ['divkid'], 'doepfer': ['doepfer'], 'dreadbox': ['dreadbox'], 'dubreq': ['dubreq'], 'dynacord': ['dynacord'], 'e-mu': ['e-mu'], 'e-rm': ['e-rm'], 'e:m:c': ['e:m:c'], 'electribe': ['electribe'], 'electrosmith': ['electrosmith'], 'electrovoice': ['electrovoice'], 'elektron': ['elektron'], 'elka': ['elka'], 'emc': ['emc'], 'emu': ['emu'], 'endorphin.es': ['endorphin.es'], 'endorphines': ['endorphines'], 'ensoniq': ['ensoniq'], 'eowave': ['eowave'], 'epiphone': ['epiphone'], 'eurorack': ['eurorack'], 'eventide': ['eventide'], 'evh': ['evh'], 'evolver': ['evolver'], 'farfisa': ['farfisa'], 'fender': ['fender'], 'fishman': ['fishman'], 'five12': ['five12'], 'fodera': ['fodera'], 'formanta': ['formanta'], 'fretlight': ['fretlight'], 'friedman': ['friedman'], 'futuresonus': ['futuresonus'], 'gator': ['gator'], 'gemini': ['gemini'], 'generalmusic': ['generalmusic'], 'gibson': ['gibson'], 'gieskes': ['gieskes'], 'godin': ['godin'], 'gotharman': ['gotharman'], 'grayscale': ['grayscale'], 'gretsch': ['gretsch'], 'guild': ['guild'], 'hammond': ['hammond'], 'hartmann': ['hartmann'], 'hexinverter': ['hexinverter'], 'hofner': ['hofner'], 'hypersynth': ['hypersynth'], 'höfner': ['höfner'], 'ibanez': ['ibanez'], 'ik': ['ik'], 'instruo': ['instruo'], 'intellijel': ['intellijel', 'designs'], 'iomega': ['iomega'], 'isla': ['isla'], 'jackson': ['jackson'], 'jaspers': ['jaspers'], 'jomox': ['jomox'], 'joranalogue': ['joranalogue'], 'kawai': ['kawai'], 'kenton': ['kenton'], 'ketron': ['ketron'], 'klavis': ['klavis'], 'knobula': ['knobula'], 'komplete': ['komplete'], 'korg': ['korg'], 'kramer': ['kramer'], 'kurzweil': ['kurzweil'], 'l-1': ['l-1'], 'lakland': ['lakland'], 'livid': ['livid'], 'lmntl': ['lmntl'], 'm-audio': ['m-audio'], 'make': ['make', 'noise'], 'malekko': ['malekko', 'heavy', 'industry'], 'maschine': ['maschine'], 'mellotron': ['mellotron'], 'mfb': ['mfb'], 'miditech': ['miditech'], 'modal': ['modal', 'electronics'], 'models': ['models'], 'modor': ['modor'], 'modular': ['modular'], 'modulus': ['modulus'], 'monome': ['monome'], 'moog': ['moog'], 'mordax': ['mordax'], 'mosaic': ['mosaic'], 'mpc': ['mpc'], 'mrseri': ['mrseri'], 'mutant': ['mutant'], 'neutron': ['neutron'], 'nord': ['nord', 'electro', 'lead', '3', '4', 'micro', 'modular', 'rack', 'stage', 'wave'], 'novation': ['novation'], 'numark': ['numark'], 'oberheim': ['oberheim'], 'octatrack': ['octatrack'], 'paratek': ['paratek'], 'pearl': ['pearl'], 'peavey': ['peavey'], 'percussa': ['percussa'], 'polyend': ['polyend'], 'polygraf': ['polygraf'], 'prs': ['prs'], 'qu-bit': ['qu-bit', 'electronix'], 'quasimidi': ['quasimidi'], 'qubit': ['qubit'], 'quiklok': ['quiklok'], 'rhodes': ['rhodes'], 'rickenbacker': ['rickenbacker'], 'roland': ['roland'], 'roli': ['roli'], 'rossum': ['rossum'], 'sanson': ['sanson'], 'schecter': ['schecter'], 'sensel': ['sensel'], 'sequentix': ['sequentix'], 'shakmat': ['shakmat', 'modular'], 'simmons': ['simmons'], 'soma': ['soma'], 'sonicware': ['sonicware'], 'soundforce': ['soundforce'], 'soundmachines': ['soundmachines'], 'spector': ['spector'], 'sputnik': ['sputnik'], 'squarp': ['squarp'], 'squier': ['squier'], 'ssff': ['ssff'], 'stanton': ['stanton'], 'steinberger': ['steinberger'], 'sterling': ['sterling'], 'strymon': ['strymon'], 'studiologic': ['studiologic', 'music'], 'supercritical': ['supercritical'], 'swissonic': ['swissonic'], 'synamodec': ['synamodec'], 'synthrotek': ['synthrotek'], 'synthstrom': ['synthstrom'], 'synthtech': ['synthtech'], 'tascam': ['tascam'], 'taylor': ['taylor'], 'technos': ['technos'], 'transient': ['transient'], 'trogotronic': ['trogotronic'], 'tubbutec': ['tubbutec'], 'u-he': ['u-he'], 'vermona': ['vermona'], 'virus': ['virus'], 'viscount': ['viscount'], 'voicas': ['voicas'], 'volca': ['volca'], 'vox': ['vox'], 'vpme.de': ['vpme.de'], 'waldorf': ['waldorf'], 'warwick': ['warwick'], 'washburn': ['washburn'], 'wurlitzer': ['wurlitzer'], 'yamaha': ['yamaha'], 'yocto': ['yocto'], 'zoom': ['zoom'], '0': ['coast'], '1010': ['music'], 'a-v-p': ['synth'], 'acid': ['rain'], 'addac': ['system'], 'after': ['later'], 'aion': ['modular'], 'ajh': ['synth'], 'allen': ['&'], 'alm': ['busy'], 'alright': ['devices'], 'analogue': ['solutions', 'systems'], 'atomo': ['synth'], 'audio': ['damage'], 'audiophile': ['circuits'], 'bastl': ['instruments'], 'black': ['corporation'], 'blackhole': ['cases'], 'blue': ['lantern'], 'boredbrain': ['music'], 'charlie': ['lab'], 'circuit': ['abbey'], 'club': ['of'], 'custom': ['made'], 'dave': ['jones', 'smith', 'instruments'], 'delta': ['music'], 'denon': ['dj'], 'dnipro': ['modular'], 'e': ['mu'], 'elby': ['designs'], 'electronic': ['music'], 'emblematic': ['systems'], 'empress': ['effects'], 'erica': ['synths'], 'ernie': ['ball'], 'erogenous': ['tones'], 'eskatonic': ['modular'], 'esp': ['ltd'], 'exodus': ['digital'], 'frap': ['tool', 'tools'], 'frequency': ['central'], 'future': ['retro', 'sound'], 'graph': ['tech'], 'hinton': ['instruments'], 'industrial': ['music'], 'io': ['instruments'], 'john': ['bowen'], 'kilpatrick': ['audio'], 'koma': ['elektronik'], 'line': ['6'], 'linn': ['electronics'], 'logan': ['electronics'], 'low-gain': ['electronics'], 'lzx': ['industries'], 'macbeth': ['studio'], 'manhattan': ['analog'], 'manikin': ['electronic'], 'meng': ['qi'], 'michigan': ['synth'], 'micro': ['modular'], 'modbap': ['modular'], 'mutable': ['instruments'], 'nano': ['modules'], 'native': ['instruments'], 'noise': ['engineering'], 'orthogonal': ['devices'], 'patching': ['panda'], 'pioneer': ['dj'], 'pittsburgh': ['modular'], 'plankton': ['electronics'], 'poly': ['effects'], 'ppg': ['ppg'], 'qu': ['bit'], 'radikal': ['technologies'], 'random': ['source'], 'ritual': ['electronics'], 'schlappi': ['engineering'], 'sequential': ['circuits'], 'special': ['waves'], 'spectral': ['audio'], 'steady': ['state'], 'studio': ['electronics'], 'synthesis': ['technology'], 'system': ['80'], 'tall': ['dog'], 'tasty': ['chips'], 'teenage': ['engineering'], 'tenderfoot': ['electronics'], 'tesseract': ['modular'], 'tiptop': ['audio'], 'traveler': ['guitar'], 'udo': ['audio'], 'uno': ['synth'], 'verbos': ['electronics'], 'waves': ['grendel'], 'wersi': ['music'], 'winter': ['modular'], 'wmd': ['ssf'], 'worng': ['electronics'], 'xaoc': ['devices'], 'xor': ['electronics'], 'zeppelin': ['design'], 'zlob': ['modular']}

compare = ''
texto_marca_compuesta =''
texto_descriptivo = ''
list_temp = []

list_compro = []
list_cambio = []
list_vendo = []
list_regalo = []
list_busco = []
list_rebaja = []

list_reparar = []
list_piezas = []
list_urgente = []
list_oferta = []

list_brand = []
list_descripcion = []    
texto_descriptivo_salida = []                                      # esto es lo que se vera como contenido del anuncio

list_price = []
list_user = []
list_city = []
list_published = []
list_expire = []
list_times_seen= [] 
list_date = []
list_dates = []

lista_palabras_para_eliminar = []



def func_compro(clave_func_dict):

    if list_compro[-1] == "0":
        list_vendo.pop(-1)
        list_vendo.append("0")
        list_compro.pop(-1)
        list_compro.append("1")
    else:
        pass


def func_cambio(clave_func_dict):
    if list_cambio[-1] == "0":
        list_vendo.pop(-1)
        list_vendo.append("0")
        list_cambio.pop(-1)
        list_cambio.append("1")

    else:
        pass

def func_vendo(clave_func_dict):
    if list_vendo[-1] == "0":
        list_vendo.pop(-1)
        list_vendo.append("1")
    else:
        pass

def func_regalo(clave_func_dict):
    if list_regalo[-1] == "0":
        list_vendo.pop(-1)
        list_vendo.append("0")
        list_regalo.pop(-1)
        list_regalo.append("1")
    else:
        pass

def func_busco(clave_func_dict):
    if list_busco[-1] == "0":
        list_vendo.pop(-1)
        list_vendo.append("0")
        list_busco.pop(-1)
        list_busco.append("1")
    else:
        pass

def func_reparar(clave_func_dict):
    if list_busco[-1] == "0":
        list_reparar.pop(-1)
        list_reparar.append("1")
    else:
        pass

def func_piezas(clave_func_dict):
    if list_busco[-1] == "0":
        list_piezas.pop(-1)
        list_piezas.append("1")
    else:
        pass

def func_rebaja(clave_func_dict):
    if list_rebaja[-1] == "0":
        list_vendo.pop(-1)
        list_vendo.append("0")
        list_rebaja.pop(-1)
        list_rebaja.append("1")
    else:
        pass

def func_oferta(clave_func_dict):
    if list_oferta[-1] == "0":
        list_oferta.pop(-1)
        list_oferta.append("1")
    else:
        pass



func_dict = {
    "compro":func_compro,
    "cambio":func_cambio,
    "vendo":func_vendo,
    "vende":func_vendo,
    "regalo":func_regalo,
    "busco":func_busco,
    "busca":func_busco,
    "reparar":func_reparar,
    "piezas":func_piezas,
    "rebajado":func_rebaja,
    "rebaja":func_rebaja,
    "oferta":func_oferta
    
}

def remove_compro(clave_func_dict):
    #list_compro.append(clave_func_dict
    list_compro.remove(clave_func_dict)


rmv_func = {"compro":remove_compro
}


def urgente():
    list_urgente.remove('0')
    list_urgente.append("1")



def eliminar_signos(txt):
    descripcion = txt.replace(":"," ")
    descripcion_1 = descripcion.replace("("," ")
    descripcion_2 = descripcion_1.replace(")"," ")
    descripcion_3 = descripcion_2.replace("/"," ")
    descripcion_4 = descripcion_3.replace("."," ").lower()
    descripcion_5 = descripcion_4.split()                   # aqui es cuando descripcion_4 se convierte en una lista como descripcion_5
    return descripcion_5

marca_del_sinte = ''

### Inicio


for pagina_anuncio in os.listdir('.'):                                  # '.' hace referencia al directorio donde está apuntando en htmls
    #if "" in pagina_anuncio:                               #compro-cambio
    
    with open(pagina_anuncio, 'r') as pagina_bruto:

        pagina_analizar = pagina_bruto.read()
        soup = BeautifulSoup(pagina_analizar, 'html.parser')

        node = soup.find('h1') 

    if  node is not None:                                           # con esto evito que me salte un error relacionado con None
        descripcion = node.text 
        descripcion = eliminar_signos(descripcion)
        #print(descripcion)

        list_cambio.append("0")
        list_compro.append("0")
        list_urgente.append("0")
        list_vendo.append("1")
        list_regalo.append("0")
        list_reparar.append("0")
        list_piezas.append("0")
        list_busco.append("0")

        list_brand.append("-")

        
        for word_1 in descripcion:
            if word_1 in accion:
                func_dict[word_1](word_1)
                lista_palabras_para_eliminar.append(word_1)
            
            elif word_1 in compare:
                list_temp.append(word_1)
            
                for marca_sinte in list_temp:
                    marca_del_sinte += marca_sinte + ' '
                    lista_palabras_para_eliminar.append(marca_sinte) # con esto borro el nombre compuesto del sinte en la descripcion.

                #marca_del_sinte = ''
                list_brand.pop(-1)
                list_brand.append(marca_del_sinte)

                compare = '' # en esta variable estaria la segunda parte del nombre por eso la limpio aqui.

            elif word_1 in sintes:
                size_brand = len(sintes[word_1])
            
                if ((size_brand == 1) and (list_brand != "-")) :
                    list_brand.pop(-1)
                    #lista_palabras_para_eliminar.append(word_1)
                    list_brand.append(word_1)
                    #print(word_1)

                elif ((size_brand == 1) and (list_brand == "-")) :
                    list_descripcion.append(word_1)

                elif size_brand > 1:
                    compare = sintes[word_1]
                    list_temp.append(word_1)

                elif list_brand != "-":
                    list_descripcion.append(word_1)

            marca_del_sinte = ''

        list_temp.clear()

        duplicates = [element for element in lista_palabras_para_eliminar if lista_palabras_para_eliminar.count(element) > 1]
        unique_duplicates = list(set(duplicates))
        #unique_duplicates = ''.join(unique_duplicates)

        #size_unique_duplicates = len(unique_duplicates)
        size_unique_duplicates = len(duplicates)

        if size_unique_duplicates > 1:
            urgente()


        for eliminar in lista_palabras_para_eliminar:
            descripcion.remove(eliminar)

        for palabras in descripcion:
            texto_descriptivo += palabras + ' '


        texto_descriptivo_salida.append(texto_descriptivo)

        #print(f"urgente: {list_urgente[-1]} compro: {list_compro[-1]} cambio: {list_cambio[-1]} vendo: {list_vendo[-1]}  marca: ###{list_brand[-1]} vendo:{list_vendo[-1]} regalo:{list_regalo[-1]} busco: {list_busco[-1]} reparar:{list_reparar[-1]} piezas:{list_piezas[-1]} descripcion: {texto_descriptivo_salida[-1]}")

        #print("\t","\t",texto_descriptivo)
        texto_descriptivo =''


        # --- price

        try:
            price = soup.find('div',class_='ad-price').text
            price = int(price.split()[0])
            list_price.append(price)
        except:
            price = 0
            list_price.append(price)



        # --- user name

        user = soup.find('div',class_='col-lg-7').a.text
        list_user.append(user)

        # --- city

        city = soup.find('div',class_='col-lg-7').div.strong.text
        list_city.append(city)

        # --- published

        published = soup.find('div',class_='col-lg-7').div.text.split()[3]
        list_published.append(published)

        # --- expire 

        expire = soup.find('div',class_="expira").text.split()[1]
        list_expire.append(expire)

        # --- times seen
        seen = soup.find('div',class_="expira").text.split()[4]
        list_times_seen.append(seen)

        lista_palabras_para_eliminar.clear()
        print(pagina_anuncio)





htmlsteclado-elka-omb5-caja-ritmosacompanamiento-automatico.html
htmlsjen-sx-2000-synthetone.html
htmlsdiskette-sonidos-sampler-korg-dsm1.html
htmlsgroovebox-roland-d2.html
htmlsmoog-subsequent-37-sub37.html
htmlsalesis-micron.html
htmlskorg-volca-bass.html
htmlsleslie-122-clasico-autonomo.html
htmlsaccess-matrix-programmer.html
htmlskit-controlador-leslies-6-pines-122-147-familias-6h-6w.html
htmlssintetizador-monofonico-korg-ms-10-70s-korg-sq1-midi.html
htmlsdoepfer-110-1-vco.html
htmlssoporte-teclado.html
htmlsyamaha-dx7-breath-controler-bc1.html
htmlsmoog-werkstatt-01-cv-expander.html
htmlssintetizador-analogico.html
htmlsroland-fantom-s-61.html
htmlsop-1-2021-segunda-generacion.html
htmlsvende-king-korg.html
htmlslote-manuales-emu-korg-roland-kawai.html
htmlskurzweil-forte-88-teclas.html
htmlsxaoc-moskwa-ii-ostankino-ii.html
htmlskorg-wavestate.html
htmlskorg-sd-400-signal-delay.html
htmlsparatek-neon-black.html
htmlsferrofish-a32-ad-da-converter.html
htmlsyamaha-motif-xf-x-maleta.

htmlscambio-modulos-eurorack.html
htmlsleslie-815-proline.html
htmlssoma-lyra-8.html
htmlsnovation-summit.html
htmlsbehringer-crave.html
htmlsdoepfer-192-2.html
htmlsmoog-system-15.html
htmlslibros-desarrollo-plugins-audio.html
htmlstiptop-audio-hats808.html
htmlsaccess-virus-snow.html
htmlsbefaco-muxlicer-mex-modulos.html
htmlsyamaha-rx-17-caja-ritmos-desktop-digital-posterior-rx11-rx15.html
htmlssinte-korg-ms-20-mini-nuevo.html
htmlsemu-proteus-1-sintetizador-rack.html
htmlsyamaha-dx7ii-d.html
htmlscasio-vl-tone-vl-1.html
htmlskurzweil-pc3-le7.html
htmlskorg-01-wfd.html
htmlskorg-ms-20-mini-modificado-pwm-sync.html
htmlsmodulos-eurorack.html
htmlsvirus-indigo-2.html
htmlsstudiologic-sl88-grand.html
htmlsantimatter-audio-crossfold.html
htmlspiano-digital-yamaha-p140.html
htmlsclavia-nord-lead-1-rack-12-voces.html
htmlskurzweil-k2000rs-estropeado.html
htmlskorg-arp-odyssey-modulo-rev3.html
htmlssintetizador-yamaha-mo6.html
htmlsvendo-cacharros-varios.html
htmlsfaltar-studio-880-sl.html

htmlsroland-tr-505.html
htmlsmu-orbit-3.html
htmlsteenage-enginering-decksaver.html
htmlskurzweil-pc3a8-funda-ruedas.html
htmlsmutable-instruments-ambika.html
htmlsstand-akai-mpc-live-1-fuera-stock-env-incl.html
htmlseurorack-loquelic-iteritas-percido.html
htmlsarturia-keylab-49-mkii-black.html
htmlscaja-ritmos-yamaha-rx7.html
htmlsmulti-efectos-the-dust-collector-finegear.html
htmlshammond-262-1968.html
htmlskorg-radias-r.html
htmlselectribe-esx-1sd.html
htmlsamplificador-teclados-peavey-kb-100.html
htmlserica-synths-black-output-v2.html
htmlsfrequency-central-bartos-flur-ii.html
htmlssound-labs-raagini-digital-electronic-tanpura.html
htmlsclavia-nord-electro-6d-73.html
htmlsmutable-instruments-midigal-midipal.html
htmlsbehringer-apr-2500-varios.html
htmlsmutable-instruments-clouds-completo.html
htmlspiano-electrico-vertical-thomann-dp-51.html
htmlsroyal-1-pianos-para-yamaha-motif-xs-motif-xf-moxf-montage.html
htmlskurzweil-k2000.html
htmlsdsi-prophet-08-libreria-kontakt-reason.html
h

htmlsmodal-argon8.html
htmlsroland-mv30.html
htmlskorg-trident-mk-i.html
htmlskorg-ms20-mini.html
htmlsteclado-casio-vintage-sk-1.html
htmlskorg-karma.html
htmlsbehringer-deepmind-12.html
htmlselektron-digitakt.html
htmlsteclado-yamaha-motif-es8-88-teclas-contrapesadas.html
htmlspolyend-tracker.html
htmlsroland-xv-5080.html
htmlsop-lab.html
htmlsteclado-maestro-behringer-umx490.html
htmlsdoepfer-100lc3v-low-cost-case-vintage-edition.html
htmlscassette-datos-korg-dw6000.html
htmlsnord-electro-hp-5.html
htmlscasiotone-701.html
htmlssecuenciador-analogue-systems-rs-200.html
htmlsteclado-korg-ds-8-1987.html
htmlskorg-01w-fd.html
htmlsmake-noise-wogglebug.html
htmlshammond-m100-ano-1962.html
htmlsroland-mc-505-groovebox.html
htmlsteclado-ketron-audya-5.html
htmlsxaoc-odesza-hel-expander.html
htmlsyamaha-psr-3000-libreria-kontakt.html
htmlskorg-pa4x-61.html
htmlsnord-lead-3.html
htmlssonicware-elz_1-sintetizador-portatil.html
htmlsdetachment-3-archangel.html
htmlsik-multimedia-synth-pro-desk

### Data extraction

The next step is to know what date we have on the day on which the data extraction is done, this an important fact since it can be useful when doing the analysis as to have a record of the ads over time.

In [8]:
hoy = dt.datetime.now()
year=str(hoy.year)

month=str(hoy.month)
day=str(hoy.day)

date_scrapped = day + '/' + month + '/' + year

date_scrapped

'25/8/2022'

In [9]:
df = pd.DataFrame({'urgent':list_urgente,
                   'buy':list_compro,
                   'change':list_cambio,
                   'sell':list_vendo,
                   'price':list_price,
                   'gift':list_regalo,
                   'search':list_busco,
                   'repair':list_reparar,
                   'parts':list_piezas,
                   'synt_brand':list_brand,
                   'description':texto_descriptivo_salida,
                   'user':list_user,
                   'city':list_city,
                   'published':list_published,
                   'expire':list_expire,
                   'date_scrapped':date_scrapped,
                   'seen':list_times_seen
                  },index = list(range(1,len(texto_descriptivo_salida)+1)))

print('Dataframa:\n', df)

hispa_csv_data = df.to_csv('hispa_dataframe.csv', index = True)

Dataframa:
     urgent buy change sell   price gift search repair parts  \
1        0   0      0    1     280    0      0      0     0   
2        0   0      0    1     345    0      0      0     0   
3        0   0      0    1      20    0      0      0     0   
4        0   0      0    1     140    0      0      0     0   
5        0   0      0    1     980    0      0      0     0   
6        0   0      0    1     350    0      0      0     0   
7        0   0      0    1      95    0      0      0     0   
8        0   0      0    1    1850    0      0      0     0   
9        0   0      0    1     700    0      0      0     0   
10       0   0      0    1     290    0      0      0     0   
11       0   0      0    1     699    0      0      0     0   
12       0   0      0    1     104    0      0      0     0   
13       0   0      0    1     120    0      0      0     0   
14       0   0      0    1      80    0      0      0     0   
15       0   0      0    1     160    0    

In [10]:
df.to_csv('hispa_dataframe.csv'.format(date=date_scrapped), index = False)

In [11]:
df

Unnamed: 0,urgent,buy,change,sell,price,gift,search,repair,parts,synt_brand,description,user,city,published,expire,date_scrapped,seen
1,0,0,0,1,280,0,0,0,0,elka,teclado elka omb5 caja de ritmos+acompañamient...,gpiccolini,Madrid,5,19/10/2022,25/8/2022,236
2,0,0,0,1,345,0,0,0,0,-,jen sx-2000 synthetone,Cobe,Madrid,04/07/2022,02/09/2022,25/8/2022,354
3,0,0,0,1,20,0,0,0,0,korg,diskette sonidos para el sampler korg dsm1,polypodium,Madrid,07/07/2022,21/09/2022,25/8/2022,178
4,0,0,0,1,140,0,0,0,0,roland,groovebox roland d2,manolo garcia,Madrid,el,05/09/2022,25/8/2022,810
5,0,0,0,1,980,0,0,0,0,moog,moog subsequent 37 sub37,Morphido,Madrid,19/07/2022,26/08/2022,25/8/2022,857
6,0,0,0,1,350,0,0,0,0,alesis,alesis micron,PowerPlant,Bizkaia,15/05/2022,18/09/2022,25/8/2022,343
7,0,0,0,1,95,0,0,0,0,volca,korg volca bass,Alejandro M. Molina Amador,Málaga,Molina,21/09/2022,25/8/2022,308
8,0,0,0,1,1850,0,0,0,0,-,leslie 122 clásico y autónomo,windbass,Zaragoza,10/07/2021,24/09/2022,25/8/2022,1021
9,0,0,0,1,700,0,0,0,0,access,access matrix programmer,cvsequencer,Barcelona,3,20/09/2022,25/8/2022,349
10,0,0,0,1,290,0,0,0,0,-,kit controlador para leslies de 6 pines 122 y ...,windbass,Zaragoza,10/07/2021,24/09/2022,25/8/2022,657


In [16]:
df.columns.value_counts().sum()

17