# Regular Expressions in Finance
### Using London Stock Exchange data

Using a raw text file downloaded from the London Stock Exchange official website, I am going to use regular expressions to answer the following questions

**1) What is the fraction of foreign firms out of total firms listed in the LSE?**

**2) What is the average market cap for domestic (UK) companies and what is the average market cap for foreign (non-UK) companies?**

**Important Note:** This text file is a snapshot of the LSE from the end of February 2021

In [15]:
# First step: Importing the text file and regular expression package
import re
def Text(filename):
    f = open(filename, 'r')
    text = f.read()                           
    f.close()
    return text
file = Text('LondonStockExchange_Listing.txt')

### How is the data structured?

Now that we have the data downloaded, the first step is to figure out its structure. It's difficult to read the raw text file, but it is clear there are nine columns, as follows:
1) Admission Date - when the stock was first listed
2) Company Name
3) Industry Classification Benchmark (ICB) Industry
4) ICB Super-Sector
5) Country of Incorporation
6) World Region
7) Market
8) International Issuer (Yes = Foreign, No = Domestic)
9) Company Market Cap (in unit of million pounds)

The international issuer column makes answering our first question relatively straightforward - the number of "Yes" over the total number of companies is equal to the percentage of foreign companies, and same with the "No" for the percentage of domestic companies. Just have to make sure that our regular expression is accurate and doesn't pick up anything that throws our analysis off.

### In the raw text file printed below, we can see the data is filled with "/t" or tabs. Since every yes or no is surrounded by these tabs, we can easily pull them out using regular expressions and then calculate their respective percentages

In [13]:
file

"\t\t\t\t\t\t\t\t\t\n\tCompanies on London Stock Exchange\t\t\t\t\t\t\t\t\n\tAs at 28 February 2021\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\n\tAdmission Date\tCompany Name\tICB Industry\tICB Super-Sector\tCountry of Incorporation\tWorld Region\tMarket\tInternational Issuer\tCompany Market Cap (£m)\n\t18/07/1994\t3I GROUP PLC\tFinancials\tFinancial Services \tUnited Kingdom\tEurope\tMAIN MARKET \tNo\t10,772.87\n\t13/03/2007\t3I INFRASTRUCTURE PLC\tFinancials\tFinancial Services \tJersey\tEurope\tMAIN MARKET \tNo\t2,594.07\n\t13/03/1953\t4IMPRINT GROUP PLC\tConsumer Discretionary\tMedia\tUnited Kingdom\tEurope\tMAIN MARKET \tNo\t685.29\n\t04/10/2005\t888 HOLDINGS PLC\tConsumer Discretionary\tTravel and Leisure\tGibraltar\tEurope\tMAIN MARKET \tYes\t1,095.98\n\t26/06/2014\tAA PLC\tConsumer Discretionary\tAutomobiles and Parts\tUnited Kingdom\tEurope\tMAIN MARKET \tNo\t217.93\n\t20/12/2005\tABERDEEN ASIAN INCOME FUND LTD\tFinancials\tFinancial Services \tJersey\tEurope\tMAI

In [20]:
# Creating the variable "foreign" to see how many foreign companies are listed
# Using 'len' to count because findall puts all the RE matches into a list
foreign = re.findall('\tYes\t', file)
len(foreign)

124

In [21]:
# Same thing for domestic companies
domestic = re.findall('\tNo\t', file)
len(domestic)

892

### Now that we have these two numbers, we can easily answer the first question proposed - what is the breakdown of foreign vs domestic companies listed on the LSE?

In [27]:
total = len(foreign) + len(domestic)
foreign_pct = len(foreign) / total
domestic_pct = len(domestic) / total
print('Of the {} companies listed on the London Stock Exchange as of February 28 2021, {:5.2f}% of them are foreign and {:5.2f}% of them are domestic'.format(total, foreign_pct*100, domestic_pct*100))

Of the 1016 companies listed on the London Stock Exchange as of February 28 2021, 12.20% of them are foreign and 87.80% of them are domestic


## Calculating the average market cap for both foreign and domestic companies is a bit tougher. Our regular expression has to be more detailed now because not only do we need to know whether the firm is domestic or not, we have to get the market cap too

Looking at the structure of the data again, you can see that market cap is listed directly after the International Issuer column. So while our regular expression will be more complex, all we need to do is capture the numbers after the Yes or No.

In [32]:
# Foreign
foreign_mktcap = re.findall('(Yes)\t([0-9,.]+)', file)
print(foreign_mktcap[0:3], len(foreign_mktcap))

[('Yes', '1,095.98'), ('Yes', '269.96'), ('Yes', '50.41')] 124


In [33]:
# Domestic
domestic_mktcap = re.findall('(No)\t([0-9,.]+)',file)
print(domestic_mktcap[0:3], len(domestic_mktcap))

[('No', '10,772.87'), ('No', '2,594.07'), ('No', '685.29')] 892


Now we have two lists, foreign and domestic, and the market caps are stored alongside the Yes/No classificaton in tuples, with each tuple representing a company. The foreign list has 124 tuples and the domestic list has 892, meaning this lines up with our previous regex expressions.

The next step is to find the average market cap for foreign and domestic companies. Using for loops, this is a simple task with our lists we already created.

In [38]:
# Foreign
foreign_list = [] # Empty list to store only market cap values
for x in foreign_mktcap: # For every tuple in the foreign list:
    foreign_list.append(x[1]) # Add x[1] (the mkt cap) to the new list
foreign_list[0:3]

['1,095.98', '269.96', '50.41']

## Important - Cleaning up the commas

Because the data is so unstructured coming from the raw text file, the commas have to be removed.

In [39]:
foreign_sum = 0 # new variable to sum all of the list's content
for x in foreign_list:
    x = float(x.replace(',','')) # replace commas with nothing
    foreign_sum += x
foreign_sum

1175323.53

In [40]:
# Now that we have the total market cap of all the foreign companies, we can easly fnd the average
foreign_avg = foreign_sum / len(foreign_list)
foreign_avg

9478.41556451613

In [41]:
# Domestic 
domestic_list = []
for x in domestic_mktcap:
    domestic_list.append(x[1])
domestic_sum = 0
for x in domestic_list:
    x = float(x.replace(',',''))
    domestic_sum += x
domestic_avg = domestic_sum / len(domestic_list)
domestic_avg

2644.351827354261

Now we have the averages, but we have to remember that the data set has the market cap in units of a million pounds

In [42]:
print('The average market cap for domestic firms listed on the LSE as of 2/28/2021 is {:5.2f} million pounds'.format(domestic_avg))

The average market cap for domestic firms listed on the LSE as of 2/28/2021 is 2644.35 million pounds


In [43]:
print('The average market cap for foreign firms listed on the LSE as of 2/28/2021 is {:5.2f} million pounds'.format(foreign_avg))

The average market cap for foreign firms listed on the LSE as of 2/28/2021 is 9478.42 million pounds
