# Hispasonic (from web to csv)


<br>

As a fan of electronic musical instruments, browsing through these types of pages has always been a enjoyable pastime. However, delving into data patterns is another aspect that strongly captivates me. This project perfectly merges both of my passions.

Hispasonic holds significant importance in Spain as a hub for musical instruments, recording equipment, and everything within the realm of music. The platform includes a second-hand market where users can sell, purchase, exchange, or even give away their musical instruments.

This initial phase of the project primarily involves gathering relevant advertisements, with a specific focus on the category of electronic musical instruments.

<br>

Before start obtaining information, the first thing we must know is to understand how the announcement page is organized.

***

- *Image of one of the pages of hispasonic*


![hispa_1e.png](images/hispa_1e.png)

<br>

<br>

We can see several important things:

- Selected category is on "teclados y sintetizadores".

- Know the number of pages that we are going to analyze to get **all the ads**.


## 1. Function library loading. 

In [1]:
import requests               # Is an elegant and simple HTTP library for Python
from bs4 import BeautifulSoup # library for pulling data out of HTML and XML files
import re                     # regular expressions operations
import pandas as pd           # A fast, powerful, flexible and easy to use open source data analysis tool
import os                     # A versatile way to use operating system-dependent functionality.
import datetime as dt         # module for manipulating dates and times.
import time                   # This module provides various time-related functions.
import random                 # This module implements pseudo-random number generators for various distributions.

### First contact

First of all we must to know if we have a proper response from the server.

In [2]:
%%html 
<style>
table {float:left}
</style>

These are the main possible answers we can get from the server:

|||
|:--|:--|
|**1xx informational response –** |the request was received, continuing process|
|**2xx successful –** |the request was successfully received, understood, and accepted|
|**3xx redirection –** |further action needs to be taken in order to complete the request|
|**4xx client error –** |the request contains bad syntax or cannot be fulfilled|
|**5xx server error –** |the server failed to fulfil an apparently valid request|

In [3]:
# Enter the address and see the response from the server.

url = "https://www.hispasonic.com/anuncios/teclados-sintetizadores"
page = requests.get(url)
page

<Response [200]>

#### *<Response [200]> means correct connection.*

## 2. Main strategy, get the number of pages to analyze.

Once we have communication, we have to know how to determine how to obtain the **total number of pages** to scrap.

In each of the pages are the ads that we want to analyze, so it is very important to know how to obtain that value, since it can vary depending on the number of ads that are offered.

![cantidad_iteraciones.png](images/cantidad_iteraciones.png)

The item is identified as follows.

       'ul', class_='pagination'
       
<br>

**ul** means *'unordered list'* with a **class** name called `pagination`.

<br>

To determine the number of iterations, that is, the number of pages on which to extract the information, I must:

- Find this element inside the html content.

- Know the final value.

<br>

We will do this with [**Beautifulsoup**](https://beautiful-soup-4.readthedocs.io/en/latest/#) use to extract the contents of an element.

In [4]:
soup = BeautifulSoup(page.content, 'html.parser')
# soup <- all site is stored in this variable

<br>

So inside `soup` variable we have all the site code, we are looking for `'ul', class_='pagination'`

The following code refers to:

- **first 5 links of the pages**

- the **next 10 pages** and the **last one**, which is the one that interests us.

Save it in a variable, called `unordered_list`

In [5]:
unordered_list = soup.find('ul', class_='pagination') # into variable unordered_list
unordered_list

<ul class="pagination">
<li>
<span class="selected">1</span>
</li>
<li>
<a href="/anuncios/teclados-sintetizadores/pagina2" rel="next">2</a>
</li>
<li>
<a href="/anuncios/teclados-sintetizadores/pagina3">3</a>
</li>
<li>
<a href="/anuncios/teclados-sintetizadores/pagina4">4</a>
</li>
<li>
<a href="/anuncios/teclados-sintetizadores/pagina5">5</a>
</li>
<li>
<a href="/anuncios/teclados-sintetizadores/pagina11" title="Siguientes 10 páginas">›</a>
</li>
</ul>

[contents and childrens (Beautiful Soup)](https://beautiful-soup-4.readthedocs.io/en/latest/index.html?highlight=contents#contents-and-children)

In [6]:
unordered_list = unordered_list.contents # tag's children available in a list called .content. from variable to list
unordered_list

['\n',
 <li>
 <span class="selected">1</span>
 </li>,
 '\n',
 <li>
 <a href="/anuncios/teclados-sintetizadores/pagina2" rel="next">2</a>
 </li>,
 '\n',
 <li>
 <a href="/anuncios/teclados-sintetizadores/pagina3">3</a>
 </li>,
 '\n',
 <li>
 <a href="/anuncios/teclados-sintetizadores/pagina4">4</a>
 </li>,
 '\n',
 <li>
 <a href="/anuncios/teclados-sintetizadores/pagina5">5</a>
 </li>,
 '\n',
 <li>
 <a href="/anuncios/teclados-sintetizadores/pagina11" title="Siguientes 10 páginas">›</a>
 </li>,
 '\n']

### 2.1 Exploring `unordered_list` variable.

`unordered_list` is a list, therefore we know what its length is and know in what position the elements that compose it are.

In [7]:
len(unordered_list) # number of elements

13

In [8]:
unordered_list[0] # first element

'\n'

In [9]:
unordered_list[-1] # last element

'\n'

In [10]:
unordered_list[-2] # this is the one I'm interested in

<li>
<a href="/anuncios/teclados-sintetizadores/pagina11" title="Siguientes 10 páginas">›</a>
</li>

### 2.2 How to get the value number from `unordered_list`?

<br>

I need to access the value within the list, so the strategy  will be the following:

- 1. Convert the list to a text string.

- 2. Filter the characters that correspond to numeric values, just the max ones.

- 3. Convert those numeric characters to numbers (int).

<br>

I will convert the contents of the list into a text string and have the numeric characters extracted together with highest values by using regular expressions.

**1. Converting the content of `paginas` into a text string**.

In [11]:
test = str(unordered_list[-2])
test

'<li>\n<a href="/anuncios/teclados-sintetizadores/pagina11" title="Siguientes 10 páginas">›</a>\n</li>'

**2/3. Filter the characters that correspond to numeric values, just the max ones** and **Convert those numeric characters to numbers (int)**.

`extractMax` A function that gets the numbers contained in the lowercase text and converts them to integer numbers.

In [12]:
def extractMax(input):
     # get a list of all numbers separated by 
     # lower case characters 
     # \d+ is a regular expression which means
     # one or more digit
     # output will be like ['100','564','365']
    numbers = re.findall(r'\d+',input)
     # now we need to convert each number into integer
     # int(string) converts string into integer
     # we will map int() function onto all elements 
     # of numbers list
    numbers = map(int,numbers)
    return max(numbers) # returns a int number

In [13]:
page_numbers = extractMax(test)
page_numbers

11

We already have the number of pages that we will have to analyze. 

***

## 3. Getting and save all links (ads and not ads)

Once we have the number of pages in which we must extract the ads, the next step is to extract those ads from each of the pages looking inside the code of each of them.

So what we have to do is:

- Extracting **everything that is a link**.


- From the links extracted, the most important thing is get the final number which is the way to **identify those who are ads and what are not**.

In [14]:
links_ads = []        # all the ads on the page
listado_enlaces = []  # all the links on the page

pattern="([0-9]{4,9})" # filtering all links with number, that mean choosing the page number related to and ad.

for pagina in range(page_numbers, 0, -1): 
    url = "https://www.hispasonic.com/anuncios/teclados-sintetizadores/pagina{pagina}".format(pagina=pagina)
    print(url)
    page = requests.get(url)
    soup = BeautifulSoup(page.content, 'html.parser')

    
    for link in soup.find_all('a'):       # filter everything that is a link on soup variable
        links_ads.append(link.get('href'))
        fecha = soup.find_all('span', class_='miniicon miniicon-date')
        
    
    for link_ad in links_ads:                   # of those links what I do is stay with what ends in number
        if re.search(pattern, link_ad):
            listado_enlaces.append(link_ad)

https://www.hispasonic.com/anuncios/teclados-sintetizadores/pagina11
https://www.hispasonic.com/anuncios/teclados-sintetizadores/pagina10
https://www.hispasonic.com/anuncios/teclados-sintetizadores/pagina9
https://www.hispasonic.com/anuncios/teclados-sintetizadores/pagina8
https://www.hispasonic.com/anuncios/teclados-sintetizadores/pagina7
https://www.hispasonic.com/anuncios/teclados-sintetizadores/pagina6
https://www.hispasonic.com/anuncios/teclados-sintetizadores/pagina5
https://www.hispasonic.com/anuncios/teclados-sintetizadores/pagina4
https://www.hispasonic.com/anuncios/teclados-sintetizadores/pagina3
https://www.hispasonic.com/anuncios/teclados-sintetizadores/pagina2
https://www.hispasonic.com/anuncios/teclados-sintetizadores/pagina1


#### This is a small sample of the contents of the lists

It can be seen as being links in both cases, in the first we only have **links that do not interest us**.

In [15]:
links_ads[5:20] # example: row 5 to 20 of everything is a link on soup variable

['/musica',
 '/productos',
 '/anuncios',
 '/anuncios/todo',
 '/anuncios/todo/f/compra-protegida',
 '/anuncios',
 '/anuncios/compraventa',
 '/anuncios/teclados-sintetizadores',
 '/anuncios/todo/f/compra-protegida',
 '/compra-protegida',
 '/anuncios/todo/f/compra-protegida',
 '/index.php?controller=ad&action=new_ad_form',
 '/anuncios/teclados-sintetizadores',
 '/anuncios/teclados-sintetizadores',
 '/anuncios/teclados-sintetizadores/pagina9']

<br>

However in the second list `listado_enlaces` what we have are the **links we want to get in each of the pages**.

In [16]:
listado_enlaces[5:20] # example: of those links what I do is stay with what ends in number

['/anuncios/slider-cap-arp-odyssey-mkiii/1124687',
 '/anuncios/compro-linndrum/1131733',
 '/anuncios/compro-linndrum/1131733',
 '/anuncios/roland-jd-xa/1123891',
 '/anuncios/roland-jd-xa/1123891',
 '/anuncios/roland-xv-3080-128-voice-rackmount-synthesizer-module/1132282',
 '/anuncios/roland-xv-3080-128-voice-rackmount-synthesizer-module/1132282',
 '/anuncios/roland-fantom-s/1118566',
 '/anuncios/roland-fantom-s/1118566',
 '/anuncios/korg-x3-modulo/1118568',
 '/anuncios/korg-x3-modulo/1118568',
 '/anuncios/taclado-casio-celviano-ap80r/1122261',
 '/anuncios/taclado-casio-celviano-ap80r/1122261',
 '/anuncios/behringer-cat/1118903',
 '/anuncios/behringer-cat/1118903']

## 3.1 Cleaning links.

Taking a look into `listado_enlaces` it is striking that there are links that are repeated.

            '...
            '/anuncios/korg-vocoder-vc10/866556',
             '/anuncios/korg-vocoder-vc10/866556',
             '/anuncios/polyend-tracker/1057403',
             '/anuncios/polyend-tracker/1057403',
             '/anuncios/trajetas-teclados/949462',
             '/anuncios/trajetas-teclados/949462',
                                             ...',

<br>

### We need to do a couple of things.

![regex_expression.png](images/regex_expression.png)

- 1. **Extract** the brand name from the url using regular expressions.

- 2. **Filter** the amount of url repeated.



<br>

<br>

To get **not repeated url**, we will make a **filter with a dictionary**.

The main idea is filter the url repeated as `key` and asign it a synth brand for this unique url as `value`.

In [29]:
os.chdir('/home/ion/Documentos/albertjimrod/hispaok/htmls') # folder where htmls folder is .

In [28]:
diccionario_enlaces = {} # dict
listado_marcas = []      # synth_brand

brand_pattern = r"((?<=anuncios\/)[1-9][a-z]{1,})|((?<=anuncios\/)[a-z]{1,})" # filter brand regex

for enlace in listado_enlaces:
    if enlace not in diccionario_enlaces:  
        try:
            marca = re.search(brand_pattern, enlace).group()
            diccionario_enlaces[enlace] = marca
        except AttributeError:
            #marca = re.search(brand_pattern, enlace)
            pass # voy a ver si funciona, lo que aprendi del try except

With the dictionary that we have just created we are going to download all the ads locally.

The reason is not to overload the server and run the risk of being banned.

## 3.2 Download all the ads.


To avoid the inconvenience that would suppose the overload of the server, we will download all the ads in local mode adding a delay in the download time. In this way we will work with more comfort.

In [18]:
%%time

main_path='https://www.hispasonic.com'
local_path = '/home/ion/Documentos/albertjimrod/hispaok/htmls/'

for enlace in diccionario_enlaces:
    # sampling interval
    time.sleep(random.uniform(1, 4))       
    page = requests.get(main_path + enlace) # https://www.hispasonic.com/anuncios/polyend-tracker/1057403.html
    # filter for extracting
    enlace = enlace.split("/")  
    # name ad
    enlace= enlace[2]         
    # Open a file in write mode ("w+"). If the file doesn't exist, it's created. If it already exists, it is overwritten
    with open(local_path + enlace + '.html',"w+") as f: 
        #Write to the file that was opened earlier (f). page.text
        f.write(page.text)
    
    print(local_path + enlace)

CPU times: user 3.64 s, sys: 212 ms, total: 3.85 s
Wall time: 17min 28s


## 3.3 It's not all about sales.

<br>

When ads have been downloaded, the next step is doing a quick scan inside the downloaded ads, so there's no only sales.


A starting point is to look in the description of the titles and see if some of these words exist.


By using `find` and `grep` together we can see if these words we are looking for are inside the files.

<br>



![vendo.png](images/vendo.png)

- **vendo : *sell***

<br>


![busco_piezas.png](images/busco_piezas.png)

- **busco, se busca: *looking for*** and - **piezas: *parts***

<br>


![cambio.png](images/cambio.png)

- **cambio: *change***

<br>
    
![compro.png](images/compro.png)

- **compro: *buy***

<br>

![regalo.png](images/regalo.png)

- **regalo: *for free***

<br>

This information will be very useful because these are **the actions**, and **it will allow us to classify** if the ad is for sale, purchase or any other concept that we have discovered.

## 3.4 Elements of the ad that we are going to extract.

<br>

Another step to take into account is to obtain:

- **description**

- **user**

- **price**

- **brand**

- **city** 

- **date published** 

- **date expire** 

- **times seen**

<br>


![hispa_4.png](images/hispa_4.png)



<br>

This is an ad as example and the fields we want to get:

## 3.5 Extraction of the `action` and the `brand` name from the description.


<br>

### 3.5-1 Extraction of `action`

<br>


The extraction contained in the fields is not very complicated, however in the main description we find a problem to solve. It is about **how to differentiate a `sale`, a `purchase` or a `change`.


To do this the strategy carried out has been to use a series of keywords in the meaning of the ad as triggers of **accion: *action*** in the event that those words exist in the description of the advertisement. 


In the same way as we (humans) would do to see if the ad is a sale or on the contrary a gift.


In [19]:
accion = ["compro","cambio","vendo","regalo","busco","busca",'reparar','piezas']

Once we have the `accion` keywords list, the next step is to make them as a trigger, that is, manage to make a certain action.

<br>

Using the words contained in `accion` list as the **key**, and the **value of the dictionary** a **call to a function depending the action on acción**.

<br>

    func_dict = {                      # the key give us the action (function)
        "compro":func_compro,
        "cambio":func_cambio,
        ...


    def func_compro(clave_func_dict):  # if `compro` means I am not selling, and so on...

    if list_compro[-1] == "0":  
        list_vendo.pop(-1)
        list_vendo.append("0")
        list_compro.pop(-1)
        list_compro.append("1")
    else:
        pass
        

<br>

### 3.5-1 Extraction of synthesizer name.

<br>


The next step to implement is to get all the possible brands of synthesizer manufacturers that we can find in the ads. 


To do this by doing an internet search I could find a list of a large number of them, at least to the date.


However due to the time I have been working with the project I already have a **list** of names `sintes` with which I have been working but that when I reached this point I realized that I had to modify and merge with the new list.

link where I obtain the brand synth: https://www.perfectcircuit.com/modular-synths

#### Synthesizer manufacturers.

In [20]:
sintes = ['coast 0_coast', '000', '4ms', 'avp a-v-p_synth', 'acces', 'access', 'digitakt elektron','voix la-voix-du-luthier', 'luthier la-voix-du-luthier', 
'oktatrak elektron','analog elektron','heat elektron','rythm elektron','digitone elektron','keys elektron','cycles elektron','samples elektron','acidlab', 'akai',
'mpc akai','alembic', 'alesis', 'allen allen_&_heath', 'analogaudio1', 'analogue solutions', 'analogue systems', 'arp', 'arturia', 'asm ashun_soundmachines', 
'atomo atomo_synth', 'damage audio_damage', 'audiophile audiophile_circuits', 'axoloty', 'balaguer', 'baloran', 'bastl bastl_instruments', 'befaco', 'behringer', 
'beringer', 'bheringer', 'bitbox', 'black corporation', 'boss', 'bubblesound instruments', 'buchla', 'casio', 'charlie lab', 'charvel', 'chronograf', 'circuit abbey', 'clavia', 
'club knobs', 'corsynth', 'cre8audio', 'crumar', 'custom made', 'cyclone', 'cyclone', 'dave jones', 'dave smith', 
'dave smith instruments', 'deepmind', 'delptronics', 'delta music', 'denon dj', 'dexibell', 'dexibell', 'digitack', 'doepfer', 
'dreadbox', 'dubreq', 'dynacord', 'e mu', 'e-mu', 'e-mu', 'e:m:c', 'elby designs', 'electribe', 'electronic music laboratories (eml)', 
'electrovoice', 'elektron', 'elka', 'emc', 'emu', 'endorphin.es', 'endorphines', 'ensoniq', 'eowave', 'epiphone', 'erica synth', 'erica synths', 
'ernie ball music man', 'esp ltd', 'eurorack', 'eventide', 'evh', 'evolver', 'exodus digital', 'farfisa', 'fender', 'fishman', 'fodera', 'formanta', 
'frap tool', 'frequency central', 'fretlight', 'friedman', 'future retro', 'futuresonus', 'gator', 'gemini', 'generalmusic', 'gibson', 'godin', 'gotharman', 
'graph tech', 'gretsch', 'guild', 'hammond', 'hartmann', 'hexinverter', 'hinton instruments', 'hofner', 'hypersynth', 'ibanez', 'ik', 'instruo', 'iomega', 
'isla', 'jackson', 'jaspers', 'john bowen synth design', 'jomox', 'kawai', 'kenton', 'ketron', 'kilpatrick audio', 'knobula', 
'koma elektronik', 'komplete', 'korg', 'kramer', 'kurzweil', 'kurzweil', 'lakland', 'line 6', 'linn electronics', 'livid', 'logan electronics', 'm-audio', 
'macbeth studio systems', 'make', 'malekko', 'manikin electronic', 'maschine', 'mellotron', 'mfb', 'micro modular', 'miditech', 
'models', 'modor', 'modular', 'modulus', 'monome', 'moog', 'mpc', 'mpc', 'mutable instruments', 'mutant', 'native instruments', 'neutron', 'noise engineering', 
'nord', 'nord electro', 'nord lead 2 rack', 'nord lead 3', 'nord lead 3', 'nord lead 4', 'nord micro modular', 'nord modular', 'nord rack', 'nord stage', 
'nord wave', 'novation', 'numark', 'oberheim', 'octatrack', 'orthogonal devices', 'paratek', 'pearl', 'peavey', 'pioneer dj', 'pittsburgh', 'pittsburgh modular', 
'polyend', 'polygraf', 'ppg (palm products gmbh)', 'prs', 'qu bit', 'qu-bit', 'qu-bit electronix', 'quasimidi', 'qubit', 'quiklok', 'radikal technologies', 
'rhodes', 'rickenbacker', 'roland', 'roli', 'sanson', 'schecter', 'sensel', 'sequencial', 'sequential circuits', 'sequentix', 'shakmat', 'simmons', 'soma', 'sonicware', 'special waves', 'spector', 'spectral audio', 'sputnik', 'squarp instruments', 
'squier', 'ssff', 'stanton', 'steinberger', 'sterling', 'strymon', 'studio electronics', 'synamodec', 'synthesis technology', 
'synthrotek', 'synthstrom', 'synthstrom', 'synthtech','swissonic', 'tascam', 'taylor', 'technos', 'teenage', 'teenage engineering', 'tiptop', 'tiptop audio', 
'traveler guitar', 'udo audio', 'uno synth ', 'vermona', 'vermona', 'virus', 'viscount', 'volca', 'vox', 'waldorf', 'warwick', 'washburn', 'waves grendel', 
'wersi', 'wersi music', 'winter modular', 'wmd', 'wmd / ssf', 'wurlitzer', 'yamaha', 'yocto', 'zeppelin design labs', 'zoom','1010 music', '2hp', '4ms', 'acid rain technology', 
'acl', 'addac system', 'after later audio', 'aion modular', 'ajh synth', 'cosmos soma', 'divina soma', 'enner soma', 'ether soma', 'flux soma', 'illuminator soma', 
'lyra-8 soma', 'lyra8-fx soma', 'metaconformer soma', 'ornament-8 soma', 'pulsar-23 soma', 'qo soma', 'reflex soma', 'roat soma', 'terra soma', 'the pipe soma',
'alm busy circuits', 'alright devices', 'analogue solutions', 'bastl instruments', 'befaco', 'blackhole cases', 'blue lantern', 'boredbrain music', 
'bubblesound', 'buchla', 'cosmotronic', 'cre8audio', 'divkid', 'dnipro modular', 'doepfer', 'dreadbox', 'e-rm','LinnDrum','Linn Electronics', 'electrosmith', 'emblematic systems', 
'empress effects', 'endorphin.es', 'eowave', 'erica synths', 'erogenous tones', 'eskatonic modular', 'eventide', 'five12', 'frap tools', 'future sound systems', 
'gieskes', 'grayscale', 'hexinverter', 'industrial music electronics','voix du luthier', 'instruo', 'io instruments', 'jomox', 'joranalogue', 'klavis', 
'koma elektronik', 'l-1', 'lmntl', 'low-gain electronics', 'lzx industries', 'make noise','eloquencer', 'malekko heavy industry', 'manhattan analog', 'meng qi', 
'michigan synth works', 'modbap modular', 'moog', 'mordax', 'mosaic', 'mrseri', 'mutable instruments', 'nano modules', 'noise engineering', 
'patching panda', 'percussa', 'pittsburgh modular', 'plankton electronics', 'poly effects', 'qu-bit electronix', 'random source', 'ritual electronics', 
'rossum', 'schlappi engineering', 'shakmat modular', 'soundforce', 'soundmachines', 'squarp', 'steady state fate', 'strymon', 'studio electronics', 
'supercritical', 'synthesis technology', 'system 80', 'tall dog electronics', 'tasty chips', 'tenderfoot electronics', 'tesseract modular', 'tiptop audio', 
'trogotronic', 'tubbutec', 'u-he', 'verbos electronics', 'vermona', 'voicas', 'vpme.de', 'winter modular', 'wmd', 'worng electronics', 'xaoc devices', 
'xor electronics', 'zlob modular',"ASM","Elektron","Moog","Teenage Engineering","Korg","Novation","Modal Electronics",
"Black Corporation","Roland","Arturia","Critter & Guitari","Polyend","UDO","Waldorf","Nord","Yamaha","Vermona","Crumar","JMT Synth","Modor",
"Studio Electronics","Trogotronic","Gieskes","Akai","Dreadbox","Herbs and Stones","IK Multimedia","Tasty Chips","Buchla","Soundmachines",
"Access","Grp","Analogue Solutions","The Division Department","Norand","Jomox","Sonicware","Radikal Technologies","Playtime Engineering",
"1010 Music","Fred's Lab","Kilpatrick Audio","Eowave","Electrosmith","Meng Qi","Studiologic","Suzuki","Nonlinear Labs","Dato","Artiphon",
"Malekko Heavy Industry","Kodamo","Hikari Instruments","Manikin Electronic","Second Sound","Arturia","Squarp","Polyend","Novation","Akai",
"Roger Linn Design","Conductive Labs","Native Instruments","Faderfox","Sensel","Roland","Keith McMillen","Pioneer","E-RM","Expressive E","Korg",
"M-Audio","Alesis","JouÃ©","Soundforce","Yamaha","Genki","Erica Synths","Make Noise","Doepfer","Elektron","Moog",
"Teenage Engineering","1010 Music","Expert Sleepers","BASTL Instruments","Kenton","Circuit Happy","MOTU","MIDI Solutions","Solid State Logic",
"Nord","Malekko Heavy Industry","Koma Elektronik","Random Source","Eowave","Zoom","Crumar","Electro-Harmonix","Grp","Michigan Synth Works",
"Analogue Solutions","Knas","iConnectivity","Soundmachines","Eurodesk-Z","Presonus","Torso Electronics","IK Multimedia","ESI Audiotechnik",
"Low-Gain Electronics","Artiphon","Instruments of Things","Apogee","SND","Moffenzeef","CME","Embodme",
"Tech 21","Snyderphonics","Tricks Magic Shop","Strymon","Vermona","OTO Machines","Dreadbox","Chase Bliss Audio","Boss","GFI","Meris",
"Eventide","SOMA Laboratory","Echo Fix","Fairfield Circuitry","Universal Audio","Gamechanger Audio","EarthQuaker Devices","Death By Audio",
"Sherman","Electro-Harmonix","Old Blood Noise Endeavors","Knas","Red Panda","Malekko Heavy Industry","Kemper","DigiTech","JAM Pedals",
"Erica Synths","Elektron","WMD","1010 Music","Roland","Korg","Poly Effects","Jomox","Thermionic Culture","Warm Audio",
"Zoom","Boredbrain Music","Meng Qi","Electrosmith","Benidub","BAE","Trogotronic","MIDI Solutions","Plankton Electronics","Vongon",
"ART","Hungry Robot","Walrus Audio","Enjoy Electronics","CIOKS","TK Audio","Source Audio","API","Voodoo Lab",
"FMR Audio","JHS Pedals","MOD Devices","Cooper FX","Finegear","Ezhi & Aka","Truetone","LastGasp Art Laboratories","Origin Effects",
"Rainger FX","Line 6","PedalTrain","Dr. Scientist","Elta Music","Keeley","Recovery","Glou-Glou","Retro Mechanical Labs","Electro-Faustus",
"Animal Factory","Hologram","Caroline Guitar Company","MXR","Second Sound","Xotic","Dunlop","Adventure Audio","ISP Technologies",
"Industrialectric","Tech 21","Collision Devices","Orgeldream","Universal Audio","API","Solid State Logic","Rupert Neve Designs","Shure","MOTU",
"Warm Audio","Focusrite","Vermona","Focal","Neumann","Roland","Thermionic Culture","Arturia","Zoom","Presonus","Adam","ART","Yamaha",
"TASCAM","Furman","Antelope Audio","Dangerous Music","Pioneer","Echo Fix","Native Instruments","Eventide","Allen & Heath",
"Meris","Sherman","dbx","BAE","Maag Audio","Empirical Labs","Avantone Pro","iConnectivity","Mackie","Audient","beyerdynamic","TK Audio",
"IK Multimedia","Black Lion Audio","RME","Keith McMillen","Golden Age Project","Audio-Technica","Fredenstein","A-Designs","Rosson Audio",
"Daking","Looptrotter","Rode","Prism Sound","Samson","Cranborne Audio","ESI Audiotechnik","Elysia","HEDD","FMR Audio","Heritage Audio",
"Avedis Audio","Sennheiser","Lindell Audio","Blue Microphones","Apogee","Recovery","M-Audio","Zeppelin Design Labs","KRK","AKG",
"Cloud Microphones","Steinberg","Alesis","Dynaudio","Austrian Audio","Auralex","IsoAcoustics","Aston Microphones","Auratone","sE Electronics","SE Electronics",
"Tech 21","Lauten Audio","Cascade Microphones","Soundrise","Pioneer","Allen & Heath","Pro-Ject","PLAYdifferently","U-Turn Audio","Audio-Technica",
"Thorens","Audioengine","Technics","Rane","AKG","Music Hall","Native Instruments","Numark","Sennheiser","Jesse Dean Designs","Ortofon",
"Rosson Audio","MWM","Gator","IK Multimedia","ART","Yamaha","Ultimate Support","RME","Roland","KRK","Austrian Audio","Shure","Odyssey",
"Teenage Engineering","Denon","Record Props","Presonus","Hosa","Hosa","Mogami ","Roland ","Voodoo Lab ","CIOKS ","LMNTL ","Warm Audio "
"Teenage Engineering ","myVolts ","Gator ","Truetone ","Strymon ","Eurodesk-Z","Furman","Elektron","Tiptop Audio","Retrokits","4MS","EBS",
"Pomona Electronics","Modbang","Intellijel Designs","Plankton Electronics","Radial Engineering","1010 Music","Native Instruments",
"Expert Sleepers","Buchla","iConnectivity","Modbap Modular","Boredbrain Music","Make Noise","Korg","Moog ","Rode ","Shure ",
"LabLab Audio ","Zoom ","Doepfer ","Koma Elektronik ","ADDAC System ","Frap Tools ","Endorphin.es ","ART s","Yamaha ","Walrus Audio",
"ALM Busy Circuits ","Analogue Solutions ","Trogotronic ","Befaco ","Boss ","Soundmachines ","LZX Industries ","Cyclone Analogic ",
"M-Audio ","E-RM ","Pulp Logic ","Electro-Harmonix ","ESI Audiotechnik ","Eskatonic Modular ","Eventide ","Instruo ","Keith McMillen",
"Malekko Heavy Industry ","Dunlop"]

In [21]:
# cleaning names in sintes list.

lista_criba = []

for marca in sintes:
    # Switch to lowercase
    marca = marca.lower()
    if marca not in lista_criba:
        # Filter out the repeated names and put them in another list
        lista_criba.append(marca)

In [22]:
lista_criba[2:10]

['4ms',
 'avp a-v-p_synth',
 'acces',
 'access',
 'digitakt elektron',
 'voix la-voix-du-luthier',
 'luthier la-voix-du-luthier',
 'oktatrak elektron']

- Once I clean the list of possible repeated names, what I do next is **the names composed of two terms are a list of two elements**.

In [23]:
# split double names in sintes list

lista_sintes= []

for marca in lista_criba:
    marca = marca.lower()
    if marca not in lista_sintes:
        marca=marca.split()
        lista_sintes.append(marca)

In [24]:
lista_sintes[2:10]

[['4ms'],
 ['avp', 'a-v-p_synth'],
 ['acces'],
 ['access'],
 ['digitakt', 'elektron'],
 ['voix', 'la-voix-du-luthier'],
 ['luthier', 'la-voix-du-luthier'],
 ['oktatrak', 'elektron']]

## 3.6 How to identify manufacturer brands?.

<br>

In this whole process, one of the things I wanted to implement was the possibility of being able to correctly read and identify the brand names of the synthesizer manufacturers from the description of the ad.



If we solely rely on a dictionary capabilyties to obtain the manufacturer's name, a clear issue may arise if two manufacturers share the first part of their name, like: 
<br>

    ...analogue systems, analogue solutions...
    
<br>

The selection criterion would not be based on the correct name but rather on the position of that name within the dictionary so, to solve this problem we have to do:


- Building a manufacturers' dictionary.

- Implement an algorithm that differentiates between multiple manufacturer brands

<br>

As an example we will use this small dictionary as if it were our dictionary of synthesizer manufacturers:

  - Manufacturers' dictionary:


    sint3 = {"analogue":["solutions","systems"]} 
    
<br>


The implementation of the algorithm will be based on detecting:

- ***single** names* like :`Roland`

- **double** and unique names* like :  `Dave Smith`

- ***double** names with a **first name common** to different manufacturers* like : `Analogue Systems or Analogue Solutions`




 
<br>

The particularity of **this** example is that it allows us to see one of the most controversial cases when it comes to the extraction of a name. 

---

### Example of an implementation of the algorithm that detects brand names in the description:

With this examples what is intended is just to understand how to read differents types of `ad description`:

- `Analogue Systems`

- ![analogue-systems](gif/analogue-systems.gif)

<br>

- `Analogue Solutions`

![analogue-solutions](gif/analogue-solutions.gif)


<br>

- `Doepfer`

- ![doepfer](gif/doepfer.gif)






in the previous example we made use of a small dictionary that we could make by hand, but we need to implement a dictionary with all manufacturers.


## Building the manufacturer's dictionary

In [25]:
marcas_nombres = []

def sint_word(sintex):
    marcas_nombres.append(sintex)
    return [marcas_nombres[-1]]

def sint_more_word_rep(sintex):
    marcas_nombres.append(sintex)
    return marcas_nombres[-1]


dict_funct = {"sint_word":sint_word,
            "sint_more_word_rep":sint_more_word_rep
}

dict_marca = {}
tag_mark = ''

for marcas in lista_sintes:
    if len(marcas) == 1:
        if marcas[0] not in dict_marca:
            tag_mark = 'sint_word'
            brand = marcas[0]
            ret = dict_funct[tag_mark](brand)
            
            dict_marca[brand] = ret
            #print("x")
            
    elif len(marcas) > 1:                           # aqui la marca tiene este formato: ['0', 'coast']
        if marcas[0] not in dict_marca:
            tag_mark = 'sint_word'
            #print(marcas[0])
            #print(marcas[1])

           
            ret = dict_funct[tag_mark](marcas[1])
            dict_marca[marcas[0]] = ret

        elif marcas[0] in dict_marca:
            tag_mark = 'sint_more_word_rep'
            ret = dict_funct[tag_mark](marcas[1])
            dict_marca[marcas[0]].append(ret)
            #print("x")

In [26]:
print(dict_marca)

{'coast': ['0_coast'], '000': ['000'], '4ms': ['4ms'], 'avp': ['a-v-p_synth'], 'acces': ['acces'], 'access': ['access'], 'digitakt': ['elektron'], 'voix': ['la-voix-du-luthier', 'du'], 'luthier': ['la-voix-du-luthier'], 'oktatrak': ['elektron'], 'analog': ['elektron'], 'heat': ['elektron'], 'rythm': ['elektron'], 'digitone': ['elektron'], 'keys': ['elektron'], 'cycles': ['elektron'], 'samples': ['elektron'], 'acidlab': ['acidlab'], 'akai': ['akai'], 'mpc': ['akai'], 'alembic': ['alembic'], 'alesis': ['alesis'], 'allen': ['allen_&_heath', '&'], 'analogaudio1': ['analogaudio1'], 'analogue': ['solutions', 'systems', 'solutions'], 'arp': ['arp'], 'arturia': ['arturia'], 'asm': ['ashun_soundmachines'], 'atomo': ['atomo_synth'], 'damage': ['audio_damage'], 'audiophile': ['audiophile_circuits'], 'axoloty': ['axoloty'], 'balaguer': ['balaguer'], 'baloran': ['baloran'], 'bastl': ['bastl_instruments', 'instruments'], 'befaco': ['befaco'], 'behringer': ['behringer'], 'beringer': ['beringer'], 'bh

It will give us:

<br>

- What was inside of `sint3[idx]` ['solutions', 'systems'] thanks to a print.


- Finally we have the we have the right name of the description, as we expected.

<br>

Once we understand the operation we are going to implement the necessary code.


## 3.7 Detecting manufacturer brands.



In [30]:
compare = ''                    #variable where the middle name is saved
marca_del_sinte = ''            # empty variable for store synth brand 
texto_descriptivo = ''          #ad descriptive text
list_temp = []                  #temporary list to detect the middle name 

                                # buy, sell, change... lists.
list_compro = []
list_cambio = []
list_vendo = []
list_regalo = []
list_busco = []
list_rebaja = []
list_reparar = []
list_piezas = []
list_urgente = []
list_oferta = []

list_brand = []                 # manufacturers synth brand
list_descripcion = []           # final ad description on dataframe output 
texto_descriptivo_salida = []   # esto es el contenido del anuncio

list_price = []                 # price
list_user = []                  # user
list_city = []                  # city
list_published = []             # date published
list_expire = []                # data expire ad
list_times_seen= []             # times seen ad

list_original=[]

lista_palabras_para_eliminar = [] # In this list I'm going to add the words that I should remove from the ad. Stocks, synth brand.

def func_compro(clave_func_dict): 
    if list_compro[-1] == "0":
        list_vendo.pop(-1)
        list_vendo.append("0")
        list_compro.pop(-1)
        list_compro.append("1")
    else:
        pass
    
def func_cambio(clave_func_dict):
    if list_cambio[-1] == "0":
        list_vendo.pop(-1)
        list_vendo.append("0")
        list_cambio.pop(-1)
        list_cambio.append("1")
        #list_price.append("0")
    else:
        pass

def func_vendo(clave_func_dict):
    if list_vendo[-1] == "0":
        list_vendo.pop(-1)
        list_vendo.append("1")
    else:
        pass

def func_regalo(clave_func_dict): 
    if list_regalo[-1] == "0":
        list_vendo.pop(-1)
        list_vendo.append("0")
        list_regalo.pop(-1)
        list_regalo.append("1")
    else:
        pass

def func_busco(clave_func_dict):  # if looking for, then is not a sell...
    if list_busco[-1] == "0":
        list_vendo.pop(-1)
        list_vendo.append("0")
        list_busco.pop(-1)
        list_busco.append("1")
    else:
        pass

def func_reparar(clave_func_dict):
    if list_busco[-1] == "0":
        list_reparar.pop(-1)
        list_reparar.append("1")
    else:
        pass

def func_piezas(clave_func_dict):
    if list_busco[-1] == "0":
        list_piezas.pop(-1)
        list_piezas.append("1")
    else:
        pass

def func_rebaja(clave_func_dict):
    if list_rebaja[-1] == "0":
        list_vendo.pop(-1)
        list_vendo.append("0")
        list_rebaja.pop(-1)
        list_rebaja.append("1")
    else:
        pass

def func_oferta(clave_func_dict):
    if list_oferta[-1] == "0":
        list_oferta.pop(-1)
        list_oferta.append("1")
    else:
        pass

func_dict = {                                                        # function dictionary
    "compro":func_compro,
    "cambio":func_cambio,
    "vendo":func_vendo,
    "vende":func_vendo,
    "regalo":func_regalo,
    "busco":func_busco,
    "busca":func_busco,
    "reparar":func_reparar,
    "piezas":func_piezas,
    "rebajado":func_rebaja,
    "rebaja":func_rebaja,
    "oferta":func_oferta
}

def remove_compro(clave_func_dict):
    #list_compro.append(clave_func_dict
    list_compro.remove(clave_func_dict)

#rmv_func = {"compro":remove_compro}


def urgente():                                                       # if some "accion" word is repeated on description, means urgency
    list_urgente.remove('0')
    list_urgente.append("1")

def eliminar_signos(txt): 
    # cleaning text
    txt = txt.lower()
    description = txt.replace(":"," ")
    descripcion = description.replace(";"," ")
    descripcion_1 = descripcion.replace("("," ")
    descripcion_2 = descripcion_1.replace(")"," ")
    descripcion_3 = descripcion_2.replace("/"," ")
    descripcion_4 = descripcion_3.replace("."," ")
    descripcion_5 = descripcion_4.split()
    return descripcion_5

def default_atributes():                                            # default actions, means all is selling, if not then function will be called.
    """
    Añade contenido a las diferentes listas con las que se trabaja en cada fila.
    """
    list_cambio.append("0")
    list_compro.append("0")
    list_urgente.append("0")
    list_vendo.append("1")
    list_regalo.append("0")
    list_reparar.append("0")
    list_piezas.append("0")
    list_busco.append("0")
    list_brand.append("-")

### Inicio


for pagina_anuncio in os.listdir('.'): # Read the contents of the directory with a for in the current "." folder
    with open(pagina_anuncio, 'r') as pagina_bruto:
        pagina_analizar = pagina_bruto.read()                        # Converts to the 'pagina_analizar' object with python's read() method
        soup = BeautifulSoup(pagina_analizar, 'html.parser')         # With Beautifulsoup, the pagina_analizar is parsed with the 'html.parser' and a variable named soup is passed
        node = soup.find('h1')                                       # All the contents of the H1 tag are searched within SOUP, and that content is fed into the NODE variable.

    if  node is not None:                                            # avoiding skipping an error related to None. An if is used to check that "node" is not empty using the condition "if node is not None"
        descripcion = node.text                                      # Using the.text method, I extract the text from Node and pass it to the Description variable
        descripcion = eliminar_signos(descripcion)                   # function that removes punctuation ;,:,(,),/... and lowercase the text
                             

        default_atributes()                                          # Calling the function default_atributes().
        
        # --- synt_brand

        for word_1 in descripcion:                                  
            if word_1 in accion:                                     
                func_dict[word_1](word_1)                           
                lista_palabras_para_eliminar.append(word_1)

            elif word_1 in compare:   
                list_temp.append(word_1)                            

                for marca_sinte in list_temp:                       
                    marca_del_sinte += marca_sinte + ' '             
                    lista_palabras_para_eliminar.append(marca_sinte) 

                list_brand.pop(-1)
                list_brand.append(marca_del_sinte)

                compare = '' 

            elif word_1 in dict_marca:                        
                size_brand = len(dict_marca[word_1])

                if ((size_brand == 1) and (list_brand != "-")) :
                    list_brand.pop(-1)
                    list_brand.append(word_1)
                    break

                elif ((size_brand == 1) and (list_brand == "-")) :
                    list_descripcion.append(word_1)

                if ((size_brand >= 1) and (list_brand == "-") and (size_brand != 0)) :  
                    list_descripcion.append(word_1) 
                    list_brand.pop(-1) 
                    x = dict_marca[word_1]
                    list_brand.append(x)
                    break

                elif size_brand >= 1:                               
                    compare = dict_marca[word_1]                    
                    list_temp.append(word_1)                         

                elif list_brand != "-":                              
                    list_descripcion.append(word_1)

            marca_del_sinte = ''
        list_temp.clear()
        
        # --- urgente
        
        duplicates = [element for element in lista_palabras_para_eliminar if lista_palabras_para_eliminar.count(element) > 1] # Detecta caracteres repetidos dentro de 'lista_de_palabras_para_eliminar' siempre que el tamaño de la lista sea superior a 1
        unique_duplicates = list(set(duplicates))                                                                             # Muestra el elemento duplicado
        size_unique_duplicates = len(duplicates)                                                                              # Muestra la longitud de esos dos elementos sumados 'size_unique_duplicates'
        if size_unique_duplicates > 3:                                                                                        # Si la longitud 'size_unique_duplicates' es superior a 3 entonces llama a la función urgente.
            urgente()                                                                                                         # Pinta un 1 en la columna urgente

        for eliminar in lista_palabras_para_eliminar:
            try:
                descripcion.remove(eliminar)                # As actions are identified, and the synth's name is removed from the ad description
            except:
                pass

            
        for palabras in descripcion:                       # The description is traversed after it has been deleted and what remains is entered into a variable 'texto_descriptivo'
            texto_descriptivo += palabras + ' '

        texto_descriptivo_salida.append(texto_descriptivo) # The variable with the content of 'texto_descriptivo' will be the text that will finally remain as a description in the final csv

        texto_descriptivo =''                              # I write the content of the variable 'texto_descriptivo' on top of it by way of a reset.


        # --- price
        
        try:
            # Try to find the element with the 'ad-price' class and extract the text
            price = soup.find('div', class_='ad-price').text
            # Quita el símbolo € del texto del precio
            price = price.replace("€", "")
            
        except AttributeError:
            # If the item is missing, assign "N/A" to the price variable
            price = 0
            # Delete the last item in list_price if it exists (there may be an error if the list is empty)
        
        finally:
            # Add the price value (either the found price or "N/A") to list_price
            list_price.append(price)
        

        # --- user name

        user = soup.find('div',class_='col-lg-7').a.text
        list_user.append(user)
        

        # --- city

        city = soup.find('div',class_='col-lg-7').div.strong.text
        list_city.append(city)

    
        # --- published

        publish = ' '

        try:
            # Find the div element with the class 'col-lg-7' and extract the text from the inner div
            published = soup.find('div', class_='col-lg-7').div.text.split()[-5:-2]

            for indx in published:
                list_original.append(indx)  # Add indx to the list list_original

                # Check to see if there's a forward slash on the item
                if '/' in indx:
                    # indx = indx.replace("/", "-")  # Reemplaza "/" por "-"
                    DD = indx[0:2]  # Extract the first two characters (day)
                    MM = indx[3:5]  # Extract the next two characters (month)
                    YYYY = indx[6:]  # Extract the remaining characters (year)
                    publish = f'{YYYY}/{MM}/{DD}'  # Create the date string in YYYY-MM-DD format
                    #print("YYYY", publish)
                    

                # If "hace" is in the element, it means that it is no longer a date that is extracted, but the reference to how long ago.
                elif 'hace' in indx:
                    #indx = indx.replace("/", "-")  # Replace "/" por "-"
                    a = published.index(indx)  # Gets the index of the current item

                    # Combine the numerical value and the unit of time (2 hours ago, 5 days ago, 2 weeks ago...)
                    publish = published[a + 1] + ' ' + published[a + 2] # <- With this I get the format of: 1 week ago or 19 hours ago...
                    #  I put the Publish content into the dataframe, later I'll modify that annoying format
                    #print("publish", publish)
                    

        except (AttributeError, IndexError):
            # If exceptions occur due to attribute or index issues, assign "N/A" to the publish variable
            publish = " "

        finally:
            # Add the final value of "publish" to the list list_published
            list_published.append(publish)

    
        
        # --- expire 

        expire = soup.find('div',class_="expira").text.split()[1]
        #expire = expire.replace("/","-")
        DD = expire[0:2]
        MM = expire[3:5]
        YYYY = expire[6:]
        date_corrected = f'{YYYY}-{MM}-{DD}'
        list_expire.append(date_corrected)
        
        

        # --- times seen
        
        seen = soup.find('div',class_="expira").text.split()[4]
        list_times_seen.append(seen)

        lista_palabras_para_eliminar.clear()

### About BeautifulSoup Warning:

This bug report is a duplicate of:  Bug #1873787: Suppress UserWarning * looks like a URL. Edit Remove

https://bugs.launchpad.net/beautifulsoup/+bug/1955450

    ...Beautiful Soup generally takes the approach of trying to give "helpful" error/warning codes so that a user understands why things are not working the way they expect. While every developer may have a different opinion on how helpful error/warnings should be done, Beautiful Soup has taken a more ambitious approach...

## 3.8 Date extraction

The next step is to know what is the extraction date. This is an important fact since because it will serve as a reference to know how long means 3 days, 1 week, 5 hours since the records were made.

In [31]:
hoy = dt.datetime.now()
year=str(hoy.year)
month=str(hoy.month)
day=str(hoy.day)
date_scrapped = year + '/' + month + '/' + day

- Dataframe created in `df` variable.

In [32]:
df = pd.DataFrame({'urgent':list_urgente,
                   'buy':list_compro,
                   'change':list_cambio,
                   'sell':list_vendo,
                   'price':list_price,
                   'gift':list_regalo,
                   'search':list_busco,
                   'repair':list_reparar,
                   'parts':list_piezas,
                   'synt_brand':list_brand,
                   'description':texto_descriptivo_salida,
                   'city':list_city,
                   'published':list_published,
                   'expire':list_expire,
                   'date_scrapped':date_scrapped,
                   'seen':list_times_seen
                  },index = list(range(1,len(texto_descriptivo_salida)+1)))

## 3.9 Clean the column of the publication dates.


As we can see sometimes the format is correct and sometimes indicates moments related to the date we are on. so it has to be corrected.

The solution is to create a function that reads that format and converts it to the correct date and format.


For this we have to implement all the cases that can be given.

In [33]:
semanas = ['1 semana', '2 semanas', '3 semanas', '4 semanas']
dias = ['1 día', '2 días', '3 días', '4 días', '5 días', '6 días', '7 días']
horas = ['1 hora','2 horas', '3 horas', '4 horas', '5 horas', '6 horas',
        '7 horas','8 horas', '9 horas', '10 horas', '11 horas', '12 horas',
        '13 horas', '14 horas','15 horas', '16 horas', '17 horas', '18 horas',
        '19 horas', '20 horas', '21 horas', '22 horas','23 horas', '24 horas']

In [34]:
minutes=[]
for mint in range(1,61):
    if mint < 2:
        texto = str(mint) + ' minuto'
        minutes.append(texto)
    else:
        texto = str(mint) + ' minutos'
        minutes.append(texto)

- `nice_format` is the function that is responsible for identifying the time intervals that the web gives us and making a time conversion.

In [35]:
def nice_format(parameter):

    days_inweek = 7
    hoy = dt.datetime.now()
    year=str(hoy.year)
    month=str(hoy.month)
    day=str(hoy.day)

    date_scrapped = year + '/' + month + '/' + day
    
    current_datetime = dt.datetime.strptime(date_scrapped,"%Y/%m/%d")  
    
    
    if parameter in semanas:
    
        num_semana = parameter.split()
        num_semana = int(num_semana[0])
        cambio_semana = semanas[num_semana-1]
        
        dias_semana = (num_semana * days_inweek)
        
        fecha_real_semana = current_datetime - dt.timedelta(dias_semana)
        
        fecha_real_semana = fecha_real_semana.strftime("%Y/%m/%d")
                
        df['published'] = df['published'].replace( to_replace = cambio_semana, value = fecha_real_semana) #+ ' semana'
        
        
    if parameter in dias:
        num_dia = parameter.split()
        num_dias = int(num_dia[0])
        cambio_dia = dias[num_dias-1]

        fecha_real_dia = current_datetime - dt.timedelta(num_dias)
        fecha_real_dia = fecha_real_dia.strftime("%Y/%m/%d")
        
        df['published'] = df['published'].replace( to_replace = cambio_dia, value = fecha_real_dia) #+ ' semana'
        
        
    if parameter in horas:
        num_hora = parameter.split()
        num_hora = int(num_hora[0])
        
        if (parameter != '24 horas'):
            hora_real = current_datetime
            hora_real = hora_real.strftime("%Y/%m/%d")
            
            df['published'] = df['published'].replace(to_replace = parameter,
                                              value = hora_real)
        
        elif parameter == '24 horas':
            horas_24 = 1
            hora_real = current_datetime - dt.timedelta(horas_24)
            hora_real = hora_real.strftime("%Y/%m/%d")
            
            df['published'] = df['published'].replace( to_replace = parameter,value = hora_real ) #+ ' semana'
    
    
    if parameter in minutes:
        horas_24 = 1
        hora_real = current_datetime - dt.timedelta(horas_24)
        hora_real = hora_real.strftime("%Y/%m/%d")
            
        df['published'] = df['published'].replace( to_replace = parameter,value = hora_real ) #+ ' semana'

In [36]:
df['published'].apply(nice_format) # Make changes on Series
print('')                          # Avoiding verbosing print

df['expire'] = df['expire'].str.replace('-', '/')
df['date_scrapped'] = df['date_scrapped'].str.replace('-', '/')

# Function to convert lists to text strings
def convert_to_string(lista):
    return str(lista)

# Function to modify the punctuation marks of a series
def remove_punctuation_marks(serie):
    serie = serie.str.replace(r',', '', regex=True)
    serie = serie.str.replace(r'\[', '', regex=True)
    serie = serie.str.replace(r'\]', '', regex=True)
    serie = serie.str.replace(r'\'', '', regex=True)
    return serie

df['synt_brand'] = df['synt_brand'].apply(convert_to_string)
df['synt_brand'] = remove_punctuation_marks(df["synt_brand"])




### Last step

We already have all the data inside the dataframe now the only thing left to do is to save the content in a csv file

In [37]:
mark = "hpw"+ year + month  + day + ".csv"
ruta = '/home/ion/Documentos/albertjimrod/hispaok/csv/'
df.to_csv(ruta + mark, index = True)

In [38]:
df.head(20)

Unnamed: 0,urgent,buy,change,sell,price,gift,search,repair,parts,synt_brand,description,city,published,expire,date_scrapped,seen
1,0,0,0,1,200,0,0,0,0,korg,korg 05r w módulo,Madrid,2023/07/27,2024/06/14,2023/12/20,456
2,0,0,0,1,360,0,0,0,0,waldorf,waldorf pulse rack,Barcelona,2023/12/18,2024/06/15,2023/12/20,88
3,0,0,0,1,495,0,0,0,0,roland,sampler s750 roland,Madrid,2023/09/30,2024/06/17,2023/12/20,440
4,0,0,1,0,0,0,0,0,0,roland,roland rd 2000,Madrid,2023/11/19,2024/06/12,2023/12/20,286
5,0,0,0,1,340,0,0,0,0,studiologic,studiologic numa compact 2,Castellón,2023/12/16,2024/06/13,2023/12/20,124
6,0,0,0,1,650,0,0,0,0,kurzweil,kurzweil pc3 le6,Madrid,2023/11/05,2024/06/17,2023/12/20,252
7,0,0,0,1,425,0,0,0,0,korg,korg drumlogue,Bizkaia,2023/01/24,2024/06/16,2023/12/20,1530
8,0,0,0,1,22,0,0,0,0,roland,sonidos roland jupiter x xm series y zenology ...,Pontevedra,2022/07/15,2024/06/16,2023/12/20,1744
9,0,0,0,1,90,0,0,0,0,doepfer,doepfer dual quantizer vintage edition,Madrid,2023/11/29,2024/06/13,2023/12/20,82
10,0,0,0,1,75,0,0,0,0,m-audio,controlador m-audio 5 octavas evolution mk461c,Madrid,2023/02/15,2024/06/16,2023/12/20,629
