# Hispasonic


<br>

A website about musical instruments, recording stuff, and everything related to the world of music. In this place there is also a second-hand market where users sell their musical instruments.

This first part of the project focuses on obtaining relevant ad information, the category I have focused on has been the one that refers to electronic musical instruments.

<br>

Before start obtaining information, the first thing we must know is to understand how the announcement page is organized.

***

- *Image of one of the pages of hispasonic*


![hispa_1e.png](hispa_1e.png)

<br>

<br>

We can see several important things:

- Selected category is on "teclados y sintetizadores".

- Know the number of pages that we are going to analyze to get **all the ads**.



## Function library loading

In [3]:
import requests               # Is an elegant and simple HTTP library for Python
from bs4 import BeautifulSoup # library for pulling data out of HTML and XML files
import re                     # regular expressions operations
import pandas as pd           # A fast, powerful, flexible and easy to use open source data analysis tool
import os                     # A versatile way to use operating system-dependent functionality.
import datetime as dt         # module for manipulating dates and times.
import time

pd.set_option("display.max_rows", None)

### first contact

First of all we must to know if we have a proper response from the server.

In [2]:
%%html 
<style>
table {float:left}
</style>

These are the main possible answers we can get from the server:

|||
|:--|:--|
|**1xx informational response –** |the request was received, continuing process|
|**2xx successful –** |the request was successfully received, understood, and accepted|
|**3xx redirection –** |further action needs to be taken in order to complete the request|
|**4xx client error –** |the request contains bad syntax or cannot be fulfilled|
|**5xx server error –** |the server failed to fulfil an apparently valid request|

In [3]:
# Enter the address and see the response from the server.

url = "https://www.hispasonic.com/anuncios/teclados-sintetizadores"
page = requests.get(url)
page

<Response [200]>

#### <Response [200]> means correct connection.

## Number of pages to analyze

Once we have communication, we have to know how to determine how to obtain the **total number of pages** to scrap.

![cantidad_iteraciones.png](cantidad_iteraciones.png)

The item is identified as follows.

       'ul', class_='pagination'
       
<br>

Unordered list from a `pagination` class.


To determine the number of iterations, that is, the number of pages on which to extract the information, I must find it inside the html content, find this element within the pages and know what the maximum value is.

We will do this with Beautifulsoup use to extract the contents of an element.

In [4]:
soup = BeautifulSoup(page.content, 'html.parser')
# soup  # all site code

Inside `soup` variable we are looking for `'ul', class_='pagination'`

The following code refers to the 

- **first 5 links of the pages**, 

- the **next 10 pages** and the **last one**, which is the one that interests us.

Save it in a variable of type list, called `unordered_list`

In [5]:
unordered_list = soup.find('ul', class_='pagination') # into variable
unordered_list = unordered_list.contents # tag's children available in a list called .content. from variable to list
unordered_list                           # list

['\n',
 <li>
 <span class="selected">1</span>
 </li>,
 '\n',
 <li>
 <a href="/anuncios/teclados-sintetizadores/pagina2" rel="next">2</a>
 </li>,
 '\n',
 <li>
 <a href="/anuncios/teclados-sintetizadores/pagina3">3</a>
 </li>,
 '\n',
 <li>
 <a href="/anuncios/teclados-sintetizadores/pagina4">4</a>
 </li>,
 '\n',
 <li>
 <a href="/anuncios/teclados-sintetizadores/pagina5">5</a>
 </li>,
 '\n',
 <li>
 <a href="/anuncios/teclados-sintetizadores/pagina11" title="Siguientes 10 páginas">›</a>
 </li>,
 '\n',
 <li>
 <a href="/anuncios/teclados-sintetizadores/pagina33" title="Última página">»</a>
 </li>,
 '\n']

### Exploring `unordered_list`

In [6]:
len(unordered_list) # number of elements

15

In [7]:
unordered_list[0] # first element

'\n'

In [8]:
unordered_list[-1] # last element

'\n'

In [9]:
unordered_list[-2] # this is the one I'm interested in

<li>
<a href="/anuncios/teclados-sintetizadores/pagina33" title="Última página">»</a>
</li>

### Get the value number from `unordered_list`

As what I need is to access the value within the list the strategy that I will follow is the following:

- Convert the list to a text string

- Filter the characters that correspond to numeric values

- Convert those numeric characters to numbers

No puedo que pueda acceder al valor que me interesa unicamente esperando que el valor que quiero esté en la penultima, asi que lo que haré será convertir el contenido de la lista en una cadena de texto y hacer un filtrado de los caracteres con mayor valor mediante expresiones regulares.

Convierto el contenido de `paginas` en cadena de texto.

In [10]:
test = str(unordered_list[-2])
test

'<li>\n<a href="/anuncios/teclados-sintetizadores/pagina33" title="Última página">»</a>\n</li>'

`extractMax` function get the string numbers separated by minuscule characters and convert it to integers.

In [11]:
def extractMax(input):
     # get a list of all numbers separated by 
     # lower case characters 
     # \d+ is a regular expression which means
     # one or more digit
     # output will be like ['100','564','365']
    numbers = re.findall('\d+',input)
     # now we need to convert each number into integer
     # int(string) converts string into integer
     # we will map int() function onto all elements 
     # of numbers list
    numbers = map(int,numbers)
    return max(numbers) # devuelve un entero

In [12]:
page_numbers = extractMax(test)
page_numbers

33

We already have the number of pages that we will have to analyze. 

***

### Get and save all ad links.

Iterating on each of the pages we will extract:

- Everything that is a link.


- Those links what I do is stay with what ends in number which is the way to identify those who are ads and what are not.

In [13]:
links_ads = []        # all the ads on the page
listado_enlaces = []  # all the links on the page

pattern="([0-9]{4,9})" # filtro aquellos enlaces que acaban en numeros.

for pagina in range(page_numbers, 0, -1): 
    url = "https://www.hispasonic.com/anuncios/teclados-sintetizadores/pagina{pagina}".format(pagina=pagina)
    print(url)
    
    page = requests.get(url)
    soup = BeautifulSoup(page.content, 'html.parser')

    
    for link in soup.find_all('a'):       # filter everything that is a link on soup variable
        links_ads.append(link.get('href'))
        
    
    for s in links_ads:                   # of those links what I do is stay with what ends in number
        if re.search(pattern, s):
            listado_enlaces.append(s)

https://www.hispasonic.com/anuncios/teclados-sintetizadores/pagina33
https://www.hispasonic.com/anuncios/teclados-sintetizadores/pagina32
https://www.hispasonic.com/anuncios/teclados-sintetizadores/pagina31
https://www.hispasonic.com/anuncios/teclados-sintetizadores/pagina30
https://www.hispasonic.com/anuncios/teclados-sintetizadores/pagina29
https://www.hispasonic.com/anuncios/teclados-sintetizadores/pagina28
https://www.hispasonic.com/anuncios/teclados-sintetizadores/pagina27
https://www.hispasonic.com/anuncios/teclados-sintetizadores/pagina26
https://www.hispasonic.com/anuncios/teclados-sintetizadores/pagina25
https://www.hispasonic.com/anuncios/teclados-sintetizadores/pagina24
https://www.hispasonic.com/anuncios/teclados-sintetizadores/pagina23
https://www.hispasonic.com/anuncios/teclados-sintetizadores/pagina22
https://www.hispasonic.com/anuncios/teclados-sintetizadores/pagina21
https://www.hispasonic.com/anuncios/teclados-sintetizadores/pagina20
https://www.hispasonic.com/anuncio

In [14]:
links_ads[5:20] # example

['/musica',
 '/productos',
 '/anuncios',
 '/anuncios/todo',
 '/anuncios/todo/f/compra-protegida',
 '/anuncios',
 '/anuncios/compraventa',
 '/anuncios/teclados-sintetizadores',
 '/anuncios/todo/f/compra-protegida',
 '/compra-protegida',
 '/anuncios/todo/f/compra-protegida',
 '/index.php?controller=ad&action=new_ad_form',
 '/anuncios/teclados-sintetizadores',
 '/anuncios/teclados-sintetizadores/pagina23',
 '/anuncios/teclados-sintetizadores/pagina29']

In [15]:
listado_enlaces[5:20] # example

['/anuncios/arturia-matrixbrute/965788',
 '/anuncios/soporte-guil/900977',
 '/anuncios/soporte-guil/900977',
 '/anuncios/trajetas-teclados/949462',
 '/anuncios/trajetas-teclados/949462',
 '/anuncios/transient-8s-secuenciador-eurorack-8-pasos/941431',
 '/anuncios/transient-8s-secuenciador-eurorack-8-pasos/941431',
 '/anuncios/piano-pearl-river-pd180d/932174',
 '/anuncios/piano-pearl-river-pd180d/932174',
 '/anuncios/tascam-246-pinch-roller/876919',
 '/anuncios/tascam-246-pinch-roller/876919',
 '/anuncios/soniccell-roland/867426',
 '/anuncios/soniccell-roland/867426',
 '/anuncios/korg-vocoder-vc10/866556',
 '/anuncios/korg-vocoder-vc10/866556']

### Cleaning links

Taking a look at the two lists it is striking that there are links that are repeated:

            '...
            '/anuncios/korg-vocoder-vc10/866556',
             '/anuncios/korg-vocoder-vc10/866556',
             '/anuncios/polyend-tracker/1057403',
             '/anuncios/polyend-tracker/1057403',
             '/anuncios/trajetas-teclados/949462',
             '/anuncios/trajetas-teclados/949462',
                                             ...',


To maintain only the links not repeated, we will make a filter from a dictionary and extracting from the link the name that can identify it as a value.

Leaving the link as a key and an identifying name of the url as a value.

In [16]:
diccionario_enlaces = {}
cnt = 0

listado_marcas = []

patron_marca = "((?<=anuncios\/)[1-9][a-z]{1,})|((?<=anuncios\/)[a-z]{1,})"

for enlace in listado_enlaces:
    if enlace not in diccionario_enlaces:
        
        try:
            marca = re.search(patron_marca, enlace).group()
            diccionario_enlaces[enlace] = marca
        except AttributeError:
            marca = re.search(patron_marca, enlace)

![regex_expression.png](regex_expression.png)

In [17]:
diccionario_enlaces['/anuncios/polyend-tracker/1057403'] # view example

'polyend'

With the dictionary that we have just created we are going to download all the ads locally.

The reason is not to overload the server and run the risk of being banned.

In [18]:
import time

main_path='https://www.hispasonic.com'
local_path = '/home/ion/Documentos/proyectos_personales/syntdata/htmls/'

for enlace in diccionario_enlaces:
    time.sleep(1) # Sleep for 3 seconds
    page = requests.get(main_path + enlace) # https://www.hispasonic.com/anuncios/polyend-tracker/1057403.html
    
    enlace = enlace.split("/")  # filtro para poder extraer
    enlace= enlace[2]           # nombre del anuncio

    with open(local_path + enlace + '.html',"w+") as f:
        f.write(page.text)

In [19]:
sintes = ['access', 'acidlab', 'akai', 'alembic', 'alesis', 'allen&heath', 'analogue solutions', 'analogue systems',
          'arp', 'arturia', 'asm (ashun sound machines)', 'audio damage', 'audiophile circuits league', 'a-v-p synth',
         'balaguer', 'baloran', 'bastl instruments', 'behringer', 'black corporation', 'böhm', 'boss', 'bubblesound instruments',
         'buchla', 'casio', 'charlie lab', 'charvel', 'circuit abbey', 'clavia nord', 'club of the knobs', 'crumar',
         'custom made synths', 'cyclone', 'cyclone analogic', 'dave jones design', 'dave smith instruments', 'denon dj',
         'dexibell', 'doepfer musikelektronik', 'dreadbox', 'dubreq', 'elby designs', 'electronic music laboratories (eml)', 'elektron',
         'elka', 'elta music', 'e:m:c', 'e-mu systems', 'endorphin.es', 'endorphines', 'ensoniq', 'epiphone', 'erica synths',
         'ernie ball music man', 'esp ltd', 'evh', 'exodus digital', 'farfisa', 'fender', 'fishman', 'fodera', 'formanta',
         'fretlight', 'friedman', 'future retro', 'futuresonus', 'gemini', 'generalmusic', 'gibson', 'godin', 'gotharman',
         'graph tech', 'gretsch', 'guild', 'hammond', 'hartmann', 'hexinverter', 'hinton instruments', 'höfner', 'hypersynth',
         'ibanez', 'ik multimedia', 'intellijel', 'isla instruments', 'jackson', 'john bowen synth design', 'jomox', 'kawai',
         'kenton', 'ketron', 'kilpatrick audio', 'koma elektronik', 'korg', 'kramer', 'kurzweil', 'lakland', 'line 6',
         'linn electronics', 'livid', 'logan electronics', 'macbeth studio systems', 'make noise', 'malekko', 'manikin electronic',
         'm-audio', 'mellotron', 'mfb', 'modal electronics', 'modor', 'modulus', 'moog', 'mutable instruments', 'native instruments',
         'novation', 'numark', 'oberheim', 'orthogonal devices', 'peavey', 'pioneer dj', 'pittsburgh modular', 'polyend', 'ppg (palm products gmbh)',
         'prs', 'quasimidi', 'qu-bit electronix', 'waves grendel', 'radikal technologies', 'rhodes', 'rickenbacker', 'roland',
         'roli', 'schecter', 'sequential', 'sequential circuits', 'sequentix', 'simmons','sintetizador', 'sonicware', 'special waves',
         'spector', 'spectral audio', 'squarp instruments', 'squier', 'ssf', 'stanton', 'steinberger', 'sterling',
         'studio electronics', 'studiologic music', 'synths', 'synthstrom', 'taylor', 'technos', 'teenage engineering', 'tiptop audio',
         'traveler guitar', 'udo audio', 'vermona', 'viscount', 'vox', 'waldorf', 'warwick', 'washburn', 'wersi music', 'winter modular',
         'wurlitzer', 'wmd', 'wmd / ssf', 'yamaha', 'zeppelin design labs', 'synthesis technology', 'nord', 'tascam', 'synthrotek',
         'e-mu', 'transient modules', 'befaco', 'tiptop', 'dynacord', 'dave smith', 'polygraf', 'studiologic', 'corsynth',
         'micro modular', 'mutant', 'chronograf', 'maschine', 'pearl', 'yocto', 'komplete', 'nord electro', 'erica synth',
         'frequency central', 'nord lead 4', 'make noise', 'modal', 'nord stage', 'acces', 'doepfer', 'eventide', 'instruo',
         'atomo synth', 'atomosynth', 'wersi', 'nord lead 3', 'vermona', 'bheringer', 'behringer', 'synamodec', 'teenage enginering',
         'sputnik', 'strymon','synthstrom', 'sequencial', 'sequential', 'sequential circuits', 'kurzweil', 'electribe', 'frap tool', 'synthtech',
         'qubit', 'qu-bit', 'qu bit', 'pittsburgh', 'cre8audio', 'paratek', 'uno synth ', 'bastl', 'volca', 'deepmind', 'deepmind 6',
         'deepmind 12', 'monome', 'neutron', 'asm', 'evolver', 'mpc', 'eurorack', 'controlador', 'teclado', 'octatrack', 'digitack',
         'models', 'bitbox', 'axoloty', 'noise engineering', 'virus ti', '0-coast', '0 coast', 'coast']

In [20]:
os.chdir('/home/ion/Documentos/proyectos_personales/syntdata/htmls/') # working directory mapping

In [21]:
os.getcwd() # checking path

'/home/ion/Documentos/proyectos_personales/syntdata/htmls'

### Extract parameters from htmls saved by regex:

To be able to extract the price, the username, and other parameters that interest us.

What we are going to do is select that element in **html** that interests us and filter the content through regex (regular expressions).

In [22]:
def get_data(pay_load, pattern_regex):
    pay_load = str(pay_load)
    pay_load = pay_load.lower()
    
    try:
        data = re.search(pattern_regex, str(pay_load))
        data = data.group()
        
    except AttributeError:
        data = re.search(pattern_regex, str(pay_load))
        
    return data 

In [23]:
list_price = []
list_brand = []
list_model = []
list_user = []
list_date = []
list_expire = []
list_times_seen= [] 
list_city = []

What we are going to do is read each over the ads that we save locally.
<br>

![hispa_4.png](hispa_4.png)

<br>

This is an ad where you can identify the fields we want to get:

- **price**

- **user**

- **city**

- **times seen**

- **date expire**

- **description**


**Note:**

The description or title of the advertisement shall be filtered to extract the manufacturer's brand the relevant **model** or **description**.


The date on which the ad has been published can change format if the user makes some kind of modification to it! but the date of publication is a good fact to take into account, hoewer due to the internal mechanism of the forum, when the user modifies announcement also does the format of the date, becoming from absolute date to a relative date.

example:

<br>

![hispa_56.png](hispa_56.png)

<br>


As a example in this announcement the publication date disappears to report how much time has passed since the last modification, being variations ranging from seconds, minutes, hours, days and weeks.


Therefore the best option to get a consistent publication date is to keep in mind that the **expiration** date refers to you having an initial expiration date of two months.



In [24]:
for i in os.listdir('.'):
    with open(i,"r") as f:
        pagina = f.read()
        soup = BeautifulSoup(pagina, "html.parser")
    
    # ---    price
        price = soup.find('div',class_='ad-price')
        pattern_price = "([1-9]\w+)"
        
        price = get_data(price,pattern_price)
        list_price.append(price)
        
        
    # ---   title
        title = soup.find('h1',class_='title') 
        pattern_title = "(?<=\">).+(?=<\/)"
        
        title = get_data(title,pattern_title)
        
        
   # ---   brand
        pattern_brand = "^\w+(?=\s)"
        brand = get_data(title,pattern_brand) # brand
        list_brand.append(brand)
        

   # ---   model
        pattern_model = '(?<=\s)\w.+(?=\s)'
        model = get_data(title, pattern_model) # model
        list_model.append(model)
        
        
    # ---  user
        user = soup.find('a',class_='user-link')
        
        pattern_user = '\w+(?=<)'
        user = get_data(user, pattern_user)
        list_user.append(user)
        
        
    # ---  dates
        data_user = soup.find('div',class_='col-lg-7')
        
        pattern_date = '[0-9]{2}\/[0-9]{2}\/[0-9]{4}(?=\s\w)' # formato dd/mm/YYYY
        pattern_semana = '\w+\s\d+\s\w+(?=\s\w{1})' # hace 1 semana
        
        date = get_data(data_user, pattern_date)
        dates = get_data(data_user, pattern_semana)
        
    #### ------------------------------------------------------------------------------------------------
        
        if date != None:
            list_date.append(date)
            
        if dates != None:
            list_date.append(dates)
                        
    #### ------------------------------------------------------------------------------------------------
        
        
    # ---  list_expire
        pattern_expira = '[0-9]{2}\/[0-9]{2}\/[0-9]{4}(?=\s\|)'# formato dd/mm/YYYY
        expire = get_data(data_user, pattern_expira)
        list_expire.append(expire)
        
        
    # ---  list_times_seen
        pattern_visto = '(?<=\s)\d\w+(?=\s)'
        times_seen = get_data(data_user, pattern_visto)
        list_times_seen.append(times_seen)
        
    
    # ---  list_city
        pattern_lugar = '(?<=\w>)\w+(?=\<)'
        city = get_data(data_user, pattern_lugar)
        list_city.append(city) 

### Data extraction

The next step is to know what date we have on the day on which the data extraction is done, this an important fact since it can be useful when doing the analysis as to have a record of the ads over time.

In [25]:
hoy = dt.datetime.now()
year=str(hoy.year)
month=str(hoy.month)
day=str(hoy.day)

date_scrapped = day + '/' + month + '/' + year

date_scrapped

'2/8/2022'

In [26]:
df = pd.DataFrame({'brand':list_brand,
                   'model':list_model,
                   'price':list_price,
                   'times_seen':list_times_seen,
                   'city':list_city,
                   'scrap_date':date_scrapped,
                   'expires':list_expire,
                   'user':list_user,
                  },index = list(range(1,len(list_expire)+1)))

df

Unnamed: 0,brand,model,price,times_seen,city,scrap_date,expires,user
1,modal,electronics cobalt,390.0,136,castellón,2/8/2022,23/09/2022,
2,moog,little phatty stage,900.0,434,baleares,2/8/2022,28/09/2022,gergö
3,,precio)doepfer a-100,420.0,214,málaga,2/8/2022,30/08/2022,manuelariza
4,,e-mu planet,,114,madrid,2/8/2022,25/09/2022,creativestudio
5,novation,sl49,330.0,224,valencia,2/8/2022,20/08/2022,vidiade
6,synthrotek,,105.0,426,girona,2/8/2022,20/09/2022,moog1406
7,roland,fantom,1300.0,328,cuenca,2/8/2022,24/08/2022,neluq
8,alesis,qx49 - controlador,80.0,132,córdoba,2/8/2022,17/09/2022,sintexia
9,roli,seaboard,220.0,402,madrid,2/8/2022,06/08/2022,rvperez
10,access,virus,405.0,813,salamanca,2/8/2022,31/08/2022,mike


In [27]:
df['expires'].isnull().sum() # a row with no values

0

In [28]:
df[df['expires'].isnull()] 

Unnamed: 0,brand,model,price,times_seen,city,scrap_date,expires,user


In [29]:
df.dropna(inplace=True) # we delete the row

In [30]:
df.to_csv('df_hispasonic.csv'.format(date=date_scrapped), index = False)

In [4]:
hispasonic = pd.read_csv("/htmls/fixed_feliz_cumple_dataframe.csv")

FileNotFoundError: [Errno 2] File /htmls/fixed_feliz_cumple_dataframe.csv does not exist: '/htmls/fixed_feliz_cumple_dataframe.csv'

In [1]:
hispasonic

/home/ion/Documentos/albertjimrod/personal_proj_hispasonic
