IDSAI_2024_lecture3_DEMO1 -------------------------------- 01000110.01001010 ---- revised: Aug2024_F.Jalalypour

Here, we will start scraping the Avanza webpage by reading the content of the file and then parsing it with Beuatiful Soup.

In [None]:
#Importing BeautifulSoup
from bs4 import BeautifulSoup      #This line imports the BeautifulSoup class from the bs4 module.  

#Reading the HTML File
with open('avanza.html','r') as f: #This lines open the file named avanza.html in read mode ('r'). 
    html = f.read()
    
#Parsing the HTML    
soup = BeautifulSoup(html, 'html.parser') #This line creates a BeautifulSoup object named soup by parsing the HTML content stored in the html variable

In [None]:
soup

Let’s break down the process of extracting and processing data from an HTML webpage, using the provided example of extracting stock names and numerical data. Comparing the visible content of the Avanza webpage, let us search for, for example, AstraZeneca, in the source. We will find that the names are contained in HTML content with elements that looks like this:

```
    <td class="orderbookName">
        <a class="ellipsis" href="/aktier/om-aktien.html/5431/astrazeneca">
            <span class="flag small SE"></span>AstraZeneca
        </a>
    </td>

```
Unfortunately, stock names and numerical data are in separate tables, so we need to handle them separately; first collecting the names and then the numerical data.

We will collect all `td` elements that have the class `orderbookName`. The name of the stock is as a textual content of the element. In this case, we can access it simply by the `.text` member. In a more complicated case, we might have to issue, e.g., `cell.find_all(text=True)`.

The text contains a lot of whitespace around it, so it makes sense to `strip` that whitespace away.

In [None]:
# Initialize an Empty List
names = list()    #This line creates an empty list named "names"

#Find All Matching Elements #find_all method returns a list of all matching elements
for cell in soup.find_all('td',class_='orderbookName'): #This line searches through the parsed HTML document (soup) for all <td> elements that have the class orderbookName.
    names.append(cell.text.strip()) #Extract and Clean Text #text: textual content; strip: whitespace away
names

In [None]:
len(names) #Length of names

In a snapshot, AstraZeneca would have the following numbers:

|Senast |+/-%|1 år %|Börsvärde MSEK|P/E-tal|Direktavk. %|Ägare |Lista
|-------|----|------|--------------|-------|------------|------|-----
|1423,00|0   |-3,98 |2 202 059     |34,05  |2,13        |62 653|Large Cap Stockholm

Here *Senast* is the latest quote of the share value, +/-% daily change in value, *1 år %* percentual change over the year, *Börsvärde MSEK* market value of the company in millions of kronor, *P/E-tal* the P/E value of the company, *Direktavkastning* is the dividend yield, *Ägare* the numer of owners, and *List* which list the stock is listed under in Stockholm stock exchange.

We can find these in the following element (A substantial amount of whitespace has been removed):
    
```
<tr class="row rowId11" id="11">
    <td class="">
        <span class="pushBox" data-aza-push="vm.pushData.latest['5431'].lastPrice" data-aza-push-fractions="2">1423,00</span>
    </td>
    <td class="neutral">
        <span data-ng-class="{'neutral': vm.commonService.isNeutral(vm.pushData.latest['5431'].changePercent), 'negative': vm.commonService.isNegative(vm.pushData.latest['5431'].changePercent), 'positive': vm.commonService.isPositive(vm.pushData.latest['5431'].changePercent)}" data-aza-push="vm.pushData.latest['5431'].changePercent" data-aza-push-fractions="2" class="neutral" style="">0</span>
    </td>
    <td class="negative">
        <span>-3,98</span>
    </td>
    <td class="">
        <span>2&nbsp;202&nbsp;059</span>
    </td>
    <td class="">
        <span>34,05</span>
    </td>
    <td class="">
        <span>2,13</span>
    </td>
    <td class="">
        <span>62&nbsp;653</span>
    </td>
    <td class="">
        <span>Large Cap Stockholm</span>
    </td>
</tr>

```

**Side Note:**

`<tr>` (Table Row) : This tag represents a single row in a table.
    
`<td>` (Table Data): This tag represents a single cell (or column) in a row.

```
<tr class="row">
  <td>Row 1 Column 1</td>
  <td>Row 1 Column 2</td>
</tr>
```

So we will need to find `tr` elements that have class `row`. 

In [None]:
soup.find('tr',class_='row')

As you see, it shows us unnecessary characters such as the name of a stock and a link to more details about the stock as well as list of buttons for buying or selling with a link to place an order.

A generic search approach might end up finding more elements than desired when other elements might have similar structures. To accurately find a specific piece of information in an HTML document, we may locate a particular element that contains a known value. For instance, a row that contains the name "AstraZeneca" or its share value.

Once you have identified the specific element containing the value, you can then locate its `parent table` (all `tr` elements belong to a `table`), and with the entire table structure more accurately.

we can match it with `Regular Expressions`. This involves finding specific elements within a webpage's HTML structure, and sometimes dealing with complex or inconsistent formatting. Use `regex` to clean up or match the text content accurately and filter out unwanted characters and focus on the actual data you need.

In [None]:
#find a specific string within an HTML document 
import re    #Imports the re module, Python's library for working with regular expressions.

# Define the pattern to find a specific string
pattern = r'.*1423,00.*' #matches any string containing 1423,00 with any characters before or after it.
element = soup.find(string=re.compile(pattern)) #Searches for text(string) in the HTML that matches the regular expression pattern provided

Now we can navigate up the HTML tree from a specific element and retrieve its parent elements:

In [None]:
table = element.parent.parent.parent.parent.parent  #to navigate up five levels from the element 
table.name                                          # and then check the name of the tag at that level.

Yay, so this is the table that we are looking for. So now let's look for the children underneath :)

In [None]:
table.find('tr',class_='row')

This is the data we want. So let's start extracting `td` element by element.

In [None]:
for cell in table.find('tr',class_='row').select('td'):
    print(cell.text.strip())

We will need to do this for all such rows, so we will use `find_all` and then construct the values.

In [None]:
values = list()
for row in table.find_all('tr',class_='row'):
    values.append([cell.text.strip() for cell in row.select('td')])
values

In [None]:
len(values)

**Data Processing and Formatting in Lists**

An important detail to note is the presence of unusual \xa0 characters in some of the values. These characters are actually non-breaking spaces. In HTML, for instance, the string '58\xa0817' would be displayed as 58&nbsp;817. The &nbsp; entity represents a non-breaking space, which looks like a regular space but behaves differently. It prevents line breaks at that position, making sure that the text remains on the same line. This is often used to separate groups of digits or to attach units to numbers without allowing the browser to break the line at that point.

In addition to these non-breaking spaces, you might also encounter commas used as digit separators. Both of these factors can cause issues during data processing. So, it's important to address and handle these characters appropriately.

After processing, each row in values looks like a combination of floats, integers, and strings, formatted consistently and ready for further analysis or visualization.

In [None]:
#Handling Special Characters and Comma Replacements
values = [list(map(lambda s: s.replace('\xa0','').replace(',','.'),row)) for row in values]

#Converting String Values to Floats and Integers
values = [list(map(float,row[:3])) + [int(row[3])] + \
          list(map(float,row[4:6])) + [int(row[6])] + [row[7]] for row in values]
values

Now we can construct a dataframe.

In [None]:
import pandas as pd
data = list()
for (name,val) in zip(names,values):
    row = { 'Name' : name,
            'Latest' : val[0],
            'Change %' : val[1],
            '1 year %' : val[2],
            'Market value MSEK' : val[3],
            'P/E' : val[4],
            'Dividend yield %' : val[5],
            'Owners' : val[6],
            'List' : val[7]
          }
    data.append(row)
data = pd.DataFrame(data)
data

In [None]:
# Display the  AstraZeneca row (indexing starts from 0)
data.iloc[11]

In [None]:
# Display the First row  (indexing starts from 0)
data.iloc[0]

**Saving Processed Data to CSV**
To export data you can use the to_csv method from pandas. 

In [None]:
data.to_csv('avanza.csv',index=None) #save the DataFrame data to a CSV file named avanza.csv without including the row indices.

Thank you for your attention :) 