# Case Study on Business Analytics and Big Data

Date: 07/February/2018
    
Author: Prof. Manoel Gadi - http://mfalonso.pythonanywhere.com/ - https://www.linkedin.com/in/manoel-gadi-97821213/


----

### How to use this case study: what is recommended that I do as a student / reader to take better advantage of reading the case?

1. Read the case and when reading the second part calle "The Case", it is recommended that the reader mimic Rita and repeat all her steps.
2. Next, the reader should answer the questions that are at the end of the case study.


### Glossary:

* SMEs: Small and Medium Enterprises - the exact rule to classify companies as Micro Enterprise, SMEs and large companies usually vary from country to country, but in general they are based on turnover (sales), total asset and number of employees.
* Rating Report: Financial Rating Report is a document that contains a rating analysis of the company analyzed.
* Rating grade: An opinion of the financial community (banks, investors or other agents) about the credit quality or insolvency risk of a company either, the rating grade can be either numerical or in letter scale.
* Triple A: The best possible rating on the rating scale.
* BBB-: The lowest grade of the investment grade group.
* BB +: The highest grade of the speculative grade group.
* Note D: Default, non-payment (chapter 11 companies are often considered note D by definition).
* Investment Grade: Companies or countries with solvency quality considered high or with low insolvency risk.
* Hygh Yeld (Speculative grade): Companies or countries with solvency quality considered low or with a high risk of insolvency.
* Market beta: Measures the degree of variability of the performance of an action with respect to the average profitability of the market. In general, the reference index in the country is used as a benchmark in the country, in the Spanish case IBEX 35.
* Listed Companies: Companies with shares listed for trading on the stock exchange.
* Ibex 35: Group of companies with the highest market capitalization among those listed on the Stock Exchange Interconnection System, composed of the four Spanish Stock Exchanges (Madrid, Barcelona, Valencia and Bilbao).
* Web scraping: technique to extract information from web pages in an automated way.
* Stasmodels: It is a Python module that provides classes and functions for the estimation of statistical models.
* Ordinary least squares (OLS): It is a method to estimate the unknown parameters in a linear regression model.

### Reference books:
* Measuring and Managing Credit Risk by Arnaud de Servigny, Olivier Renault
* Credit Scoring for Risk Managers: The Handbook for Lenders by Elizabeth Mays 
* Managing a Consumer Lending Business by David Lawrence and Arlene Solomon
* Python for Finance - Analyze Big Financial Data by Yves Hilpisch


-----

## Developing a Rating Model for Companies using Real Scrapped Yahoo Data and Machine Learning in Python.

## Company Rating Fundamentals

The analysis of the activity and solvency of a company is one of the most classic applications of statistical and analytical methods. When deciding to give a loan, expanding or cutting financing, financial institutions no longer apply subjective criteria, instead they use scientific tools, one of them called Rating. The Rating can be understood as a set of algorithms applied at the time of the decision. To build such a rating, institutions look, among other indices, at quantitative aspects, which are mainly financial ratios combined in a certain proportion. This could be compared with a cooking recipe in which there are some ingredients are used in a certain proportion, ad others are not used at all. The challenge from Analytics is to find what ingredients to use, the proportions and how to combine them. By doing so, statistical methods and / or Machine Learning techniques are used.

### What is a rating report and what is it used for?


"With this tool, it is possible to actively manage your balance and achieve more ways of accessing financing sources and better conditions. "


In a very direct and probably very boring way, one could say that a Financial Rating Report is a document that contains an analysis and an opinion of the financial community (banks, investors or other agents) about the credit quality or insolvency risk of a company. The opinion is summarized in a credit risk score or rating following a standardized numeric or letter scale. The credit quality of credit risk is the result of a process of analysis of quantitative and qualitative factors that affect not only the company, but also its sector and its country.

However, the famous economist Michael Porter, author of the book "Competitive Advantage", considered the bible of business thinkers, would not have achieved the same impact, a real Eureka moment among business experts, if he had explained the Value Chain as a diagram that organizes the interconnected activities in the company, or maybe he would. 

Michael Porter presents the idea of the Value Chain as the map of the gold mine to understand how a certain company puts all its machinery to transform money into more money. A true X-ray that reveals a diagram of interconnected activities of the bone structure of the company, where one sees how it transforms money into raw material, then into final product, going on sale and finally resulting in profit. The Value Chain is this X-Ray that allows us to understand the operational functioning of a company.

External Reference: Michael Porter - Value Chain: https://www.youtube.com/watch?v=9e2KtGvI1Pw

![title](Porter_Value_Chain.png)

Following the medical metaphor, we can say that the Rating Report is the blood analysis of the company. The blood analyzer measures, for example, the volume of platelets, lymphocytes, monocytes and other names that sometimes seem to us palaver without much sense. However, we are able to understand that we are health if the measured value is within the lower and upper reference limits that appear in the report. These limits in general, are benchmarks of healthy people with similar characteristics to us, how they can be the same gender and similar age.

The Rating Reports have a very similar structure to the blood analysis. Here blood is the company's money and within the annual accounts it appears mainly in the shape of cash, debt and equity. Therefore, we measure the adequacy of ratios such as the leverage ratio, liquidity ratio, profitability ratio and the capital structure to reference values of similar companies, often from the same sector and size as the company being analyzed. We call this first part of the report Quantitative Rating. But laboratories result by themselves are not everything we need. It is always recommended to visit the doctor so that he interprets the analytical report and asks some questions to have a global evaluation of the person. In the Rating Report, this visit is part of an interview and the answer of a qualitative questionnaire. Although very structured and regulated, this interview tries to map issues as diverse as the evolution of demand and the market, understand the history of the partners and the management team, the strategy and its business plan as a whole. Today it is impossible to imagine a company that does not know its operative X-Ray through the Chain of Value of its business and its sector. In the same way, it should be unimaginable to conceive a company that does not know the "blood analysis" of its money. A company that does not have the correct tools to understand its insolvency risk and does not know how the bank perceives it. A company with these limitations consequently does not have the ability to compare their creditworthiness with that of their competition.

Bellow is an example of the automatic part of a Rating Repport, assessing the following ratios according to grades AAA (best) to D (worst):    Debt Coverage, Interes Coveraga, Current Ratio, Quick Ratio, Cash Flow Liquidity, Gross Financial Debt / Total Assets, Equity / Total Assets, Asset Turnover, ROA, Ebitda Margin, Operating Profit Margin, Variation of Ebitda:

![title](RatingSample.gif)

After all, the Rating Report is the basis for accessing funding sources and their terms and conditions. Once in possession of this tool, one can actively manage its balance to improve the credit rating and, therefore, have access to more sources of financing, better conditions and, finally, bigger bottom line (profits).

### Who are the main players of the Rating industry?

The main rating agencies in the world are:
* Moody’s - https://www.moodys.com/
* Standard and poor's  (S&P) - https://www.standardandpoors.com
* Fitch - https://www.fitchratings.com


In Spain:
* Axesor - https://www.axesor.es 


Moody's and S & P can be considered the most influential agencies because of their great coverage worldwide.

Rating agencies in general get their income in two ways:
1. The collection of fees from the evaluated companies / counterparts. These fees are usually charged at the time of evaluation and annual payments for the renewal of the note. The Rating´s prices depend very much on the size and market in which a company operates. As a reference, we can say that the prices are in a range that goes from 30 thousand to 200 thousand euros, approximately.
2. Through the sale of research projects, consulting services, software, or proprietary information.


### How is the Rating grade scale?

![title](RatingGrade.png)

### The context of the Rating of Spanish Companies

According to the rating report of Bravo Capital, a financing company, the note of the companies follows a Normal distribution:

![title](BravoAllCompanies.png)

45% of Spanish companies are between BBB- (investment grade limit) and BB +.

The main detriment of the companies that obtain high notes resides in the low profitability for the shareholder when being in such note. However, an AAA company has the potential to acquire much more debt than it has, for example, to grow organically in new markets or through acquisitions. On the opposite side of the bell, the drawback is the lack of liquidity and solvency. This situation results in the bank demanding the repayment of debts, a phenomenon that leads many companies to tender or liquidation to repay the above-mentioned debts.

### Public listed company´s rating

Comparison between IBEX 35 companies and companies in the continuous market. 

![title](PublicListedSpain.png)

The IBEX 35 companies have a better distribution than the Spanish average, while the NO IBEX 35 worse than the Spanish average.

### Is size an important variable to obtain a better rating?

The Spanish business environment is composed, mainly, by small SMEs (Small and Medium Enterprises). Taking into account that larger companies get a better Rating, Spanish companies are penalized by this factor. 

But does it make sense that a bigger company get a better Rating just for being bigger? It actually does, this is known as a portfolio effect, that is, it is less likely that a large company will lose 20% of its clients in a given period of time, than a small company find itself in this situation during the same period. Therefore, time, statistically speaking, happens to play a key role in the phenomenon of the loss of customers.


Below is a graph showing the Rating average by range of sales (in millions of Euros €)

![title](RatingBySize.png)

### Is the risk different depending on the sector?

As one can see bellow, the Rating by sectors of Spanish companies shows that the average Rating of Construction is much lower than the one of the Public Health Sector, and that makes all sense as the level of debt and the perceived risk of both industry are different.

![title](RatingPorSector.png)

## The Case

Claudia Tarragona, CEO of the company InfoEmpresarial Spain, realized in 2014 that there were very few companies that offered Rating services in Spain. Given the banking crisis that was taking place in the country, Claudia observed that the Rating tool could be key to reduce the risk balance sheet of Spanish SMEs; and consequently, bring the whole Spanish market to a more comfortable position, in terms of systemic risks.

However, Claudia had a big problem. She was not an expert in Rating and believed that no one in her company was. One morning, Claudia invited José Campos, the company's financial director, to grab coffee and asked him about his understanding on the subject of Rating. His response was direct. Like Claudia, José was not knowledgeable about the topic. However, as a financial director, he perfectly knew everything regarding capital and debt models, such as the Capital Asset Pricing Model and many other valuation models. Additionally, he knew that the company's bank used internal models to evaluate them, but he did not know how to dig deeper into the matter either. Sometime into the conversation he had a great idea, perhaps, they could talk to Nakamura (employee at the Takashi partners fund) whom he already knew and who was, in addition, an expert investment analyst in Rating.

The next morning, Nakamura showed up at the InfoEmpresarial office, invited by his friend José Campos. Without exactly knowing what the reason for his visit was, Nakamura began to complain about the low effectiveness of the expansive economic policy. According to Nakamura, the investment banking market was in an extremely difficult position. The crisis in which the companies were immersed had worsened their rating scores, increased the risks for investments and, therefore, decreased margins. Also, as if this was not hard enough, interest rates at historic lows tightened, even more, the profit margin. But, for both the Japanese investment analyst and for the financial director, in the current scenario there were situations that were very bizarre. From his point of view, the flow of money from the central bank did not seem to be really reaching the companies. In short, the banking industry also had to fulfill its obligations, and these met the new capital needs imposed by Basel; according to which, lending money to companies in a bad financial situation was not an idea that they were passionate about.

Having reached this point, Claudia (CEO) commented that everything the Takashi fund analyst was telling them was very interesting. However, the reason for his call was related to a proposal. Claudia wanted to offer a job to him in order to build a rating system for InfoEmpresarial, with the main goal of helping Spanish companies to improve their credit ratings. After commenting on the proposal, Nakamura, in a very polite way, opted to decline her offer. Nakamura claimed that although he was a great user of the Rating tool, his knowledge limited to the use of Porter's 5 forces , some profit, activity ratios and debt ratios and, finally, the Rating grade to guide his investment decisions. In a simple way, his job was to calculate the valuation of companies. He does that by using EBITDA factors, adjusting this valuation for the predictions of growth or decrease, then adjust it again using the 5 forces of Porter, and finally adjusting it once again depending on how risky the business is, and the riskiness comes out of the Rating (above BBB- good, below BB+ risky). The investment analyst reiterated that he was not the right person for that job and that, unfortunately, he did not know anyone in the international agencies or in a bank that could help them.

A few moments later Nakamura retreated; he remembered that he knew a person who could help them. It was a headhunter named Fred Asterix who was dedicated to the search of managerial candidates in the Risk industry and who, according his understanding, had work for Rating agencies in some occasions. So, Nakamura, after recommending Fred, facilitated his contact information to Claudia and José, who greatly appreciated it since they would need to contract his services to be able to conduct an efficient search for a candidate. Finally, both the CEO and the CFO thanked Nakamura one more time as they were sure they had found the right contact that would help them hire this "extraterrestrial" professional. couple of weeks later, as a result of some emails and telephone conversations with the headhunter Fred Asterix to whom Claudia transmitted all her needs, Fred provided his first candidate.

The first candidate was Joaquin Quintana, a true star professional in the corporate banking sector, with more than 15 years' experience as an Enterprise Risk Analyst and who currently held the position of Enterprise Risk director at the VVBA bank. 

Used to dealing with the enterprise world, Joaquin quickly began to explain to Claudia what methods he used when evaluating a company. Furthermore, since he was a very practical person and also a professor at a well-known business school, he quickly put aside the subjective field of conversation. Thus, he invited Claudia to do some real calculations with Ibex35 companies. 

He kindly asked Claudia to enter the web https://finance.yahoo.com, look for the Telefonica company (the ticker is TEF.MC) and click on Financials. Once there, they saw Telefonica's profit and loss account. From the analysis of this Financial Statement they could extract that as of December/31/2016 the Net Profit Margin had been 4.47%. This result was reached by dividing Net Income / Total Revenue.



    

$Net Profit Margin = \frac{Net Income}{Total Revenue} = \frac{2369000}{52903000} = 4,47\%$

Claudia found it very interesting. For her, as for any professional linked to the financial world, Net Profit Margin is a very important ratio, in order to know the performance of a company.

However, Joaquín stressed that this ratio only used variables from the income statement and that it therefore biased the analysis for a single period, in this case a single year. To avoid this bias, Joaquín suggested that it would also be convenient to look at the balance sheet ratios; which they did by clicking on the Balance Sheet tab. In this way, they calculated one of the possible debt ratios, Total Debt / Total Asset. For the numerator they needed to add the short debt and the long-term debt that appeared in the balance sheet. Short-term debt appeared as [Short / Current Long Term Debt] and long-term debt [Long Term Debt].


$\frac{Total Debt}{Total Assets}= \frac{[Short Current Long Term Debt]  +  [Long Term Debt]}{Total Assets} = \frac{60,361,000 +  45,612,000}{123,641,000} =85,71\%$

After observing the result of the debt ratio, Claudia thought that Telefonica was quite indebted.

To which Joaquín replied that only with this information he was not able to tell much about their financial status. "Here is where my technical limitation appears", commented the risk analyst. To analyze this company and know if a Net Profit Margin of 4.47% is reasonable or if Total Debt / Total Assets = 85.71% is a highly indebted company, it would be required to take into account the whole industry. That is, compare Telefonica with other companies in the same economic sector. He also let Claudia know the enormous amount of time it would require to manually calculate those ratios for not more than 100 companies. At this point in the conversation, Joaquín honestly told Claudia that he was not the professional she was looking for. To carry out the sectoral analysis and, finally, to build an own Rating model, what she needed was a Model Analyst or, even better, a professional who was being born in that exact moment. Finally, Joaquín asked Claudia: "Have you ever heard about Data Scientists or Big Data professionals?"

Given the refusal of the CEO, Joaquin replied that he knew a Big Data course professor named Sasha Popov and that, probably, he was the person she was looking for. Joaquín went on to note that "our" Big Data professor had worked for many years in the areas of Models, Methodology and Analytics in the Banking sector. And explained that in order for her to understand how a data scientist could help her, she would have to think about what they have just done by taking the yahoo information manually to calculate the ratios. Finally, he stressed that Sasha could build a program called Web Scraper that would be able in a matter of minutes to collect all the data from the income statements and bance sheets of hundreds of companies. With this tool he could compare Telefonica with numerous companies in the sector. He also mentioned that Sasha would be able to build the model not only with two ratios but using hundreds of them.

On top of that, Sasha would use Machine Learning methods and Model Development techniques to discover which ratios are the most relevant. He would finally build a Rating model giving weight to each of these ratios. "Do not be scared if he starts calling the ratios: variables." Joaquin said.

"Well, I'll call Professor Sasha today," Claudia answered. To which she concluded: "It has never been so difficult to hire a profile. These people from Big Data need to sell themselves better, because I had heard about them, but I thought they were computer experts, or people that knew how to deal only with "big databases", like our IT team. 

So that is what Claudia did, and after two interviews in which Claudia became very impressed with Sasha, and Sasha became very excited with the project, the agreement was imminent. Sasha's only requirement to sign the contract was that he wanted to also hire a student of his. Claudia had no problem accepting his request. Thus, on March 1, 2014, Sasha and Rita (student of Sasha) began their job challenge in the newly created Rating department.

When the two of them arrived on their first day of work, they already had a dual-core Windows laptop and 16 Gigabytes of RAM for each of them. It was the configuration that Sasha had requested before starting to work. He knew that he would not need more than 2 processors, but since he was going to use a Python library (called pandas and that uses data in memory) his bottleneck would be memory. For this reason, he asked for the maximum that could be given, which was 16 gigs of RAM. One of the first tasks that Sasha entrusted to Rita, her student and now employee, was to ask the IT department to install the most advanced version of Anaconda Python, since the last time he had a look they had the 3.6 one. As it was possible that the IT department did not know where to download the program and its most up-to-date version, to avoid any confusion, she sent the following links:


* https://www.anaconda.com/download/ 
* https://docs.continuum.io/anaconda/install/windows 


Moreover, to ensure that there were no incidents, in case they had questions, she also attached a video, which Sasha sent her once: https://www.youtube.com/watch?v=EbYGBANqDdY.  However, as they were computer scientists they had no problem installing, quickly and easily, Anaconda Python.

While our Director of Rating was still arranging his table, Rita approached him and told him that he had been preparing a small list of companies in various sectors at home. This list was from companies in the United States and each of them had their ticker symbols from Yahoo Finance next to the name. Rita consulted her superior if he would like her to start writing a Python code to go and Scrape some information from Yahoo Finance.

Sasha said that he would like to see the list before, so Rita gave him the following link:


* https://www.dropbox.com/s/2x1rmt2ma96j2my/yahoo_ticker_sample.xlsx?dl=1 

As soon as the file was opened, Sasha was happy to see that Rita had prepared 5 sectors with a reasonable number of companies. Thus, they could be able to rank them according to several ratios that they would build. He directly asked Rita to work on the code. However, he asked her not to complicate her life at that stage downloading income statements and balance sheets, and simply focus her efforts on downloading the information from the Key Statistics (KS) of Yahoo; He gave her the example to download the Google KS, which was just a matter of using the 'GOOG' ticker he had in his file on the link below:

* https://finance.yahoo.com/quote/GOOG/key-statistics?p=GOOG

Rita, who had not had any doubt, clicked on the link and using the right button of his mouse clicked again 'View source code of the page'. Although Rita was not an expert in HTML, she began to alternate between the original page and the source code. She started looking for the values that she saw on the original page in the source code, with the intention of understanding how Yahoo's HTML was built, it did not take long to identify that she simply had to look for the concept she wants with a ">" at the beginning and then the tags </ td> </ tr> at the end. For example: "> Market Cap" and "</ td> </ tr>".

In the afternoon of the same day, with Anaconda Python installed on her "supercomputer", Rita started working on her Python code. She had previously worked with a Python library called Scrapy. However, this time she preferred to write the code that would "peel the onion" of the Yahoo Finance Key Statistics HTML from scratch, because she realized that Yahoo's HTML had small variations in the tags that could make her life a hell if she used Scrapy.

After two intense days of work, Rita had already built the Python code to scrap Yahoo Finance Key Statistics data:


In [1]:
#Importing libraries
from urllib.request import urlopen
import pandas as pd
import time
from random import randint

In [2]:
# READING THE CASE YAHOO TICKER SAMPLE LIST!
df = pd.read_excel("https://www.dropbox.com/s/2x1rmt2ma96j2my/yahoo_ticker_sample.xlsx?dl=1")
print (df.head())

  ticker country      sector RefIndex
0   GOOG     USA  Technology      SPY
1   MSFT     USA  Technology      SPY
2     FB     USA  Technology      SPY
3      T     USA  Technology      SPY
4   ORCL     USA  Technology      SPY


In [3]:
# LIST OF FIELDS WE WILL SCRAPE
list_of_fields = ['Market Cap', 'Enterprise Value', 'Trailing P/E', 'Forward P/E', 'PEG Ratio', 'Price/Sales', 'Price/Book', 'Enterprise Value/Revenue', 'Enterprise Value/EBITDA', 'Fiscal Year Ends', 'Most Recent Quarter', 'Profit Margin', 'Operating Margin', 'Return on Assets', 'Return on Equity', 'Revenue', 'Revenue Per Share', 'Quarterly Revenue Growth', 'Gross Profit', 'EBITDA', 'Net Income Avi to Common', 'Diluted EPS', 'Quarterly Earnings Growth', 'Total Cash', 'Total Cash Per Share', 'Total Debt', 'Total Debt/Equity', 'Current Ratio', 'Book Value Per Share', 'Operating Cash Flow', 'Levered Free Cash Flow', 'Beta', '52-Week Change', 'S&amp;P500 52-Week Change', '52 Week High', '52 Week Low', '50-Day Moving Average', '200-Day Moving Average', 'Avg Vol (3 month)', 'Avg Vol (10 day)', 'Shares Outstanding', 'Float', '% Held by Insiders', '% Held by Institutions', 'Shares Short', 'Short Ratio', 'Short % of Float', 'Shares Short (prior month)', 'Forward Annual Dividend Rate', 'Forward Annual Dividend Yield', 'Trailing Annual Dividend Rate', 'Trailing Annual Dividend Yield', '5 Year Average Dividend Yield', 'Payout Ratio', 'Dividend Date', 'Ex-Dividend Date', 'Last Split Factor', 'Last Split Date']
list_of_dates = ['Fiscal Year Ends', 'Most Recent Quarter', 'Dividend Date', 'Ex-Dividend Date', 'Last Split Date']
print("list_of_fields=",list_of_fields)
print("list_of_dates=",list_of_dates)

list_of_fields= ['Market Cap', 'Enterprise Value', 'Trailing P/E', 'Forward P/E', 'PEG Ratio', 'Price/Sales', 'Price/Book', 'Enterprise Value/Revenue', 'Enterprise Value/EBITDA', 'Fiscal Year Ends', 'Most Recent Quarter', 'Profit Margin', 'Operating Margin', 'Return on Assets', 'Return on Equity', 'Revenue', 'Revenue Per Share', 'Quarterly Revenue Growth', 'Gross Profit', 'EBITDA', 'Net Income Avi to Common', 'Diluted EPS', 'Quarterly Earnings Growth', 'Total Cash', 'Total Cash Per Share', 'Total Debt', 'Total Debt/Equity', 'Current Ratio', 'Book Value Per Share', 'Operating Cash Flow', 'Levered Free Cash Flow', 'Beta', '52-Week Change', 'S&amp;P500 52-Week Change', '52 Week High', '52 Week Low', '50-Day Moving Average', '200-Day Moving Average', 'Avg Vol (3 month)', 'Avg Vol (10 day)', 'Shares Outstanding', 'Float', '% Held by Insiders', '% Held by Institutions', 'Shares Short', 'Short Ratio', 'Short % of Float', 'Shares Short (prior month)', 'Forward Annual Dividend Rate', 'Forward A

In [4]:
# CREATING EMPTY FIELD IN THE DATA FRAME FOR RECORDING THE SCRAPED DATA.
for i in range(len(list_of_fields)):
    df[list_of_fields[i]] = ''
df['ScrapedName'] = ''
df['Sector'] = ''


In [5]:
stock = df['ticker'][0]
sourceCode = str(urlopen('https://finance.yahoo.com/quote/'+stock+'/profile?p='+stock).read())

In [6]:
ScrapedAux = sourceCode.split('>Sector')[1].split('</strong>')[0].split('>')
print(ScrapedAux)

['</span', '<!-- react-text: 20 --', ':\\xc2\\xa0<!-- /react-text --', '<strong data-reactid="21"', 'Technology']


In [7]:
ScrapedSector = ScrapedAux[len(ScrapedAux)-1]
print(ScrapedSector)

Technology


In [9]:
# MAIN LOOK - SCRAPPING DATA - FOR EACH TICKER IN THE df
error = 0
for j in range(len(df)):    
#    print ("ticker =", df['ticker'][j],j, " de ", len(df))    
   
    stock = df['ticker'][j]

    # REQUESTING INFORMATION FROM YAHOO AND INSISTING IF DOES NOT RESPOND - STOPPING THE PROGRAM IF YAHOO DOES NOT REPLY FOR TWO tickers IN SEQUENCE!
    try:
        sourceCode = str(urlopen('https://finance.yahoo.com/quote/'+stock+'/key-statistics?p='+stock).read())
        error = 0
    except:
        time.sleep(randint(1,12))
        try:
            sourceCode = str(urlopen('https://finance.yahoo.com/quote/'+stock+'/key-statistics?p='+stock).read())
            error = 0
        except:
            if error == 0:
                time.sleep(randint(10,100))
                error = 1
                continue
            else:
                break

    # FINDING COMPANY NAME IN YAHOO PAGE TO MAKE SURE WE GETTING THE CORRECT COMPANY - COMPANY NAME IS EASY TO FIND.
    compname= sourceCode.split('Find out all the key statistics for')[1].split(', including')[0]
    if compname.find('{shortName} ({symbol})') >= 0:
        continue   
    df['ScrapedName'].iloc[j] = compname           

    # YAHOO FINANCE HAS 2 WAYS OF OPENING AND CLOSING TAGS </td></tr> OR </span></td></tr>, SO WE REMOVE THE </span> TO MAKE THEM ALL THE SAME:
    sourceCode = sourceCode.replace('</span>','')
    
    for field in list_of_fields:
        try:
            ScrapedValue = sourceCode.split('>' + field)[1].split('</td></tr>')[0].split('>')[-1]
            df[field].iloc[j] = ScrapedValue
        except:
            pass
    # ALSO GRABBING COMPANY SECTOR FROM YAHOO PAGE.
    try:
        ScrapedSector = ''
        sourceCode = str(urlopen('https://finance.yahoo.com/quote/'+stock+'/profile?p='+stock).read())
        ScrapedAux = sourceCode.split('>Sector')[1].split('</strong>')[0].split('>')
        ScrapedSector = ScrapedAux[len(ScrapedAux)-1]
        df['Sector'].iloc[j] = ScrapedSector
    except:
        True
    print ("ticker =", df['ticker'][j], "company name =", compname,"sector=",ScrapedSector,j, " de ", len(df))
    

ticker = GOOG company name =  Alphabet Inc. (GOOG) sector= Technology 0  de  86
ticker = MSFT company name =  Microsoft Corporation (MSFT) sector= Technology 1  de  86
ticker = FB company name =  Facebook, Inc. (FB) sector= Technology 2  de  86
ticker = T company name =  AT&amp;T Inc. (T) sector= Communication Services 3  de  86
ticker = ORCL company name =  Oracle Corporation (ORCL) sector= Technology 4  de  86
ticker = VZ company name =  Verizon Communications Inc. (VZ) sector= Communication Services 5  de  86
ticker = TSM company name =  Taiwan Semiconductor Manufactur (TSM) sector= Technology 6  de  86
ticker = INTC company name =  Intel Corporation (INTC) sector= Technology 7  de  86
ticker = CSCO company name =  Cisco Systems, Inc. (CSCO) sector= Technology 8  de  86
ticker = IBM company name =  International Business Machines (IBM) sector= Technology 9  de  86
ticker = NVDA company name =  NVIDIA Corporation (NVDA) sector= Technology 10  de  86
ticker = ACN company name =  Accen

In [10]:
# SAVING TO EXCEL ALL SCRAPED  DATA!
df.to_excel("yahoo_ticker_sample_scraped.xlsx")
df

Unnamed: 0,ticker,country,sector,RefIndex,Market Cap,Enterprise Value,Trailing P/E,Forward P/E,PEG Ratio,Price/Sales,...,Trailing Annual Dividend Rate,Trailing Annual Dividend Yield,5 Year Average Dividend Yield,Payout Ratio,Dividend Date,Ex-Dividend Date,Last Split Factor,Last Split Date,ScrapedName,Sector
0,GOOG,USA,Technology,SPY,714.29B,622.32B,38.39,21.68,1.61,5.50,...,,,,0.00%,,,1/1,"Apr 27, 2015",Alphabet Inc. (GOOG),Technology
1,MSFT,USA,Technology,SPY,796.08B,766.07B,42.71,20.61,1.71,6.93,...,1.72,1.65%,2.29,69.14%,"Mar 14, 2019","Feb 20, 2019",1/2,"Feb 18, 2003",Microsoft Corporation (MSFT),Technology
2,FB,USA,Technology,SPY,382.9B,372.86B,20.08,17.88,1.01,7.38,...,,,,0.00%,,,,,"Facebook, Inc. (FB)",Technology
3,T,USA,Technology,SPY,217.03B,397.49B,5.79,8.33,1.44,1.32,...,2.00,6.72%,5.31,38.02%,"Feb 1, 2019","Oct 9, 2018",1/2,"Mar 20, 1998",AT&amp;T Inc. (T),Communication Services
4,ORCL,USA,Technology,SPY,175.98B,175.22B,49.00,12.66,1.47,4.41,...,0.76,1.66%,1.38,78.35%,"Oct 30, 2018","Oct 15, 2018",1/2,"Oct 13, 2000",Oracle Corporation (ORCL),Technology
5,VZ,USA,Technology,SPY,230.81B,349.16B,7.14,11.83,2.03,1.77,...,2.37,4.26%,4.52,30.22%,"Feb 1, 2019","Jan 9, 2019",937889/1000000,"Jul 2, 2010",Verizon Communications Inc. (VZ),Communication Services
6,TSM,USA,Technology,SPY,184.88B,174.58B,16.25,15.10,1.07,5.52,...,0.26,0.72%,,60.58%,"Jul 19, 2018","Jun 25, 2018",1/1,"Jul 15, 2009",Taiwan Semiconductor Manufactur (TSM),Technology
7,INTC,USA,Technology,SPY,207.98B,233.12B,14.24,9.97,0.98,3.00,...,1.17,2.45%,2.82,36.41%,"Dec 1, 2018","Nov 6, 2018",1/2,"Jul 31, 2000",Intel Corporation (INTC),Technology
8,CSCO,USA,Technology,SPY,193.96B,189.06B,164.66,13.03,1.55,3.86,...,1.28,2.91%,2.98,412.90%,"Jan 23, 2019","Jan 3, 2019",1/2,"Mar 23, 2000","Cisco Systems, Inc. (CSCO)",Technology
9,IBM,USA,Technology,SPY,105.81B,141.68B,18.76,8.37,8.79,1.32,...,6.14,5.26%,3.30,98.08%,"Dec 10, 2018","Nov 8, 2018",1/2,"May 27, 1999",International Business Machines (IBM),Technology


She also had the Excel file in her hands with the results:

* https://www.dropbox.com/s/66vnpxeiesqcg78/yahoo_ticker_sample_scraped.xlsx?dl=1

Rita's first contact with modeling had been the Machine Learning course in Finance of Professor Sasha, who was now her boss. The professor told him in several sessions that the equation that was used for modeling was:

                        y = f(X)

Where:
* X, are the independent variables or input (the variables that Rita had just scraped).
* f, was the predictive method applied to adjust a model and later predict the "y" using X (Rita already knew the Logistic and Linear Regression, Decision Tree and other methods, but she was not sure which one to apply).
* y, was the objective variable, variable target or variable output, which is the variable that requires prediction.


The "y" was a mystery for Rita, since she had never worked in a model before. In addition, she did not know how to get it and how to define it, so she did not hesitate to ask her boss on how he should proceed.

Sasha told him that, in Rating, the variable to be predicted is the bankruptcy of a company in a certain future period of time. The bankruptcy in most of the cases entails the closure of the company with debt. However, that is not it, to that we must add other undesired situations, such as for example: a company can generate unpaid percentages, because many times the bank prefers to make a 30% reduction of the original debt (and recover 70%) rather than taking the company to court (and in many cases to the subsequent liquidation) and risk losing an even greater percentage. In what concerns the time window, Sasha commented that it would depend on the time of the credit product to which the model is to be applied to. That's why the Rating models are not one, but many with different time windows. The normal thing is to find at least two versions, the short-term version (of one year) and the long-term version (of 5 years). Once this point was made, the head of the Rating department went on to say that the first objective variable they could create was:
    
* 1 – (bad) companies with debt in the next 12 months.
* 0 – (good) anything else.

Rita interrupted: "In this case we would apply a classification method, since the output variable is binary, right?"

Sasha replied affirmatively. However, the problem that had at that time InfoEmpresarial Spain was that they did not have internal data of defaulted companies, so, to create a model close to reality, they had to use external data, probably from a Credit Bureau such as Experian or Equifax. These data would cost them money, but without it, they will not be able to build a model. Getting such data would take a long time, so Sasha asked Rita to start building a code for the model with an approximate objective variable, that is, with what we call a Proxy.

This Proxy could be the Rating score of other agencies or the probability of default given to them by the Bloomberang system, which due to the observations Sasha believed it was based on the behavior of the stock market.

One more time Rita interrupted to add: "But now we would be talking about a regression method, because the variable output has many possible values, right?

Sasha replied that there was more than one answer to her question, but that for the test they were going to do a Linear Regression would be enough.

So as not to complicate further their first development, they would use something they already had. It was the same principle that the Bloomberang company used, a measure of the company's volatility. This would be their "y". The idea was the confidence in the wisdom of the crowds ("The Wisdom of Crowds"), that is, in the coming days the market would become more volatile if there was more risk in a company or would become less volatile if there was less risk. "Have you heard about the Beta market?" Sacha asked, and he continued "If you have not seen it yet, you have it in your yahoo finance scraped information."
If Rita really wanted to understand what the Beta was, Sasha suggested to take a look at slide 21 of his time series course. After a few moments, Rita looked for it and found it:


In [11]:
from IPython.display import Image
import os
print(os.getcwd())
Image(filename='BetaCalculation.png', width =1000, height=700)


C:\Manoel


FileNotFoundError: [Errno 2] No such file or directory: 'BetaCalculation.png'

Sasha continued: "Now, since I know you'll be interested to learn how to calculate it in Python, take a look in a code I use in my Financial classes to calculate Beta, but carefull with versions, should the the next piece of code do not run correctly, open the console and run: pip install pandas-datareader==0.5.0"

In [13]:
# -*- coding: utf-8 -*-
"""
reference: http://gouthamanbalaraman.com/blog/calculating-stock-beta.html
"""
import pandas_datareader.data as web
import datetime
import numpy as np
import pandas as pd


# Grab time series data for 5-year history for the stock (here TEF.MC)
# and for IBEX 35 Index

edate =   datetime.datetime(2004,12, 31, 0, 0, 0, 0)
sdate = edate - datetime.timedelta(days=5*365)

ticker_symbol = 'TEF.MC'
ref_index = '^IBEX'

df_stock = web.DataReader(ticker_symbol,'yahoo',sdate,edate)
df_index = web.DataReader(ref_index,'yahoo',sdate,edate)

# create a time-series of monthly data points
df_stock = df_stock.resample('M').last()
df_index = df_index.resample('M').last()

df_stock['returns'] = df_stock['Adj Close']/ df_stock['Adj Close'].shift(1) -1
df_stock = df_stock.dropna()
df_index['returns'] = df_index['Adj Close']/ df_index['Adj Close'].shift(1) -1
df_index = df_index.dropna()

df = pd.DataFrame({'stock_returns' : df_stock['returns'],
                        'index_returns' : df_index['returns']},
                        index=df_stock.index)
df = df.dropna()

# reference - http://ci.columbia.edu/ci/premba_test/c0331/s7/s7_5.html
def covariance(a, b):
    if len(a) != len(b):
        return
    a_mean = np.mean(a)
    b_mean = np.mean(b)
    sum = 0
    for i in range(0, len(a)):
        sum += ((a[i] - a_mean) * (b[i] - b_mean))
    return sum/(len(a)-1)


print(edate)
numerator = covariance(df['stock_returns'],df['index_returns'])
print("COVARIANCE(stock, benchmark) = COVARIANCE("+ticker_symbol+", "+ref_index +") = " +str(numerator))
denominator = covariance(df['index_returns'],df['index_returns'])
print("VARIANCE(benchmark) = COVARIANCE(benchmark, benchmark) = COVARIANCE("+ref_index+", "+ref_index +") = " +str(denominator))

# BETA = Covariance (stock,index) / Variance (Index) = Covariance (stock,index) / Covariance (stock,stock)
print("BETA = COVARIANCE(stock, benchmark) / VARIANCE(benchmark) = " + str(numerator) + " / " + str(denominator) + " = " +str(covariance(df['stock_returns'],df['index_returns'])/covariance(df['index_returns'],df['index_returns'])))
#http://www.investopedia.com/ask/answers/070615/what-formula-calculating-beta.asp

2004-12-31 00:00:00
COVARIANCE(stock, benchmark) = COVARIANCE(TEF.MC, ^IBEX) = 0.005082405456775695
VARIANCE(benchmark) = COVARIANCE(benchmark, benchmark) = COVARIANCE(^IBEX, ^IBEX) = 0.003928426971265477
BETA = COVARIANCE(stock, benchmark) / VARIANCE(benchmark) = 0.005082405456775695 / 0.003928426971265477 = 1.2937507796252816


---
When Rita saw that he had all the elements for this simulated development, she told her boss that she would "get down to work" with the code. Nevertheless, Sasha, who would be at her side in the development and warned her that the most important thing for a Data Scientist is not to capture the data and apply methods, but to understand the problem that is being modeled and how that translates into actions that should be taken on the data.

Rita did not understand, but she sat enthusiastically next to her boss to work on her first project outside the Master in Big Data. 

Sasha saw Rita's doubtful face, so he gave her the example of what would happen if Dr. Pepper, a direct competitor of Coca Cola and Pepsi, ceased to exist. Later, he added that one possibility was for people would stop drinking cola, but he was convinced that most Dr. Pepper consumers would migrate on to consuming cola from the competition, which are, Coca Cola and Pepsi. For our Big Data professor, the situation was comparable to two individuals who were escaping from a lion in the African savanna. In this situation the individuals did not need to run more than the lion, it was enough to be faster than his opponent. Sasha continued: "Rita, in terms of data, these anecdotes are translated in that for all the variables that we have scraped, we need to create a new transformed variable using its order, its ranking".

Sasha added that he had seen cases in which transformations of the "percentile" type were applied (something easily achieved using the "percentile" function in Python) or Normalization (applying the formula (Variable - median) / standard deviation); but he saw theoretical and practical problems in these transformations and commented that his preferred transformation technique is Slot or Bucketing Normalization.

Sasha turned again to a slide of his course to explain to Rita how the transformation of the variable Return On Assets (ROA) would look like in 6 ranges.


![title](ROA_Histogram.png)

The new variable would receive the values of the "Bucket Associated" row, which are 0, 1, 2, 3, 4, 5 and 6 according to the value of the ROA of the company.

Sasha explained that if Telefonica's ROA was 2.93%, his bucket would be 1 (which would translate into a "BBB" if the Rating formula considered only one variable.) He added that the sizes of the ranges change to create a normal distribution of the new variable, where the first rank has 7.14% of the population and the second has 14.29%, then 28.57%, and finally again 28.57%, 14, 29% and 7.14%.

Sasha concluded by saying that the sample of companies scraped by Rita (yahoo_ticker_sample_scraped.xlsx) were composed by between 16 and 18 companies for each sector. However, some variables would not have data for all the companies. Thus, what should be done was to translate the original variables into a score between 0 and 5 attributing to the worst company a score of 5, the following two a score of 4, the next 4-6 companies a score of 3, the following 4-6 a 2, the next two a score of 1 and, finally, the best company would score a 0.


| Variable grade | 5    | 4    |  3   |  2   |  1   |  0   |
|----------------|------|------|------|------|------|------|
| # companies    | 1    | 2    |  4-6 |  4-6 |  2   |  1   |


Rita, after having reviewed all the codes she worked on in the classes of his Big Data master, and consulting some forums, she found the answer of how to implement the transformation and develop a first simple model using Ordinary Least Square of Statsmodels in Python.

In [14]:
#IMPORTING LIBRARIES
import pandas as pd
import numpy as np
import statsmodels.api as sm


In [15]:
# READING THE CASE YAHOO TICKER SAMPLE LIST!
df = pd.read_excel("yahoo_ticker_sample_scraped.xlsx")

print(df)

   ticker country      sector RefIndex Market Cap Enterprise Value  \
0    GOOG     USA  Technology      SPY    714.29B          622.32B   
1    MSFT     USA  Technology      SPY    796.08B          766.07B   
2      FB     USA  Technology      SPY     382.9B          372.86B   
3       T     USA  Technology      SPY    217.03B          397.49B   
4    ORCL     USA  Technology      SPY    175.98B          175.22B   
5      VZ     USA  Technology      SPY    230.81B          349.16B   
6     TSM     USA  Technology      SPY    184.88B          174.58B   
7    INTC     USA  Technology      SPY    207.98B          233.12B   
8    CSCO     USA  Technology      SPY    193.96B          189.06B   
9     IBM     USA  Technology      SPY    105.81B          141.68B   
10   NVDA     USA  Technology      SPY     84.49B           83.73B   
11    ACN     USA  Technology      SPY     96.23B            94.6B   
12    TXN     USA  Technology      SPY     86.96B           90.49B   
13    VOD     USA  T

In [16]:
# LIST OF FIELDS SCRAPPED
list_of_fields = ['Market Cap', 'Enterprise Value', 'Trailing P/E', 'Forward P/E', 'PEG Ratio', 'Price/Sales', 'Price/Book', 'Enterprise Value/Revenue', 'Enterprise Value/EBITDA', 'Fiscal Year Ends', 'Most Recent Quarter', 'Profit Margin', 'Operating Margin', 'Return on Assets', 'Return on Equity', 'Revenue', 'Revenue Per Share', 'Quarterly Revenue Growth', 'Gross Profit', 'EBITDA', 'Net Income Avi to Common', 'Diluted EPS', 'Quarterly Earnings Growth', 'Total Cash', 'Total Cash Per Share', 'Total Debt', 'Total Debt/Equity', 'Current Ratio', 'Book Value Per Share', 'Operating Cash Flow', 'Levered Free Cash Flow', 'Beta', '52-Week Change', 'S&amp;P500 52-Week Change', '52 Week High', '52 Week Low', '50-Day Moving Average', '200-Day Moving Average', 'Avg Vol (3 month)', 'Avg Vol (10 day)', 'Shares Outstanding', 'Float', '% Held by Insiders', '% Held by Institutions', 'Shares Short', 'Short Ratio', 'Short % of Float', 'Shares Short (prior month)', 'Forward Annual Dividend Rate', 'Forward Annual Dividend Yield', 'Trailing Annual Dividend Rate', 'Trailing Annual Dividend Yield', '5 Year Average Dividend Yield', 'Payout Ratio', 'Dividend Date', 'Ex-Dividend Date', 'Last Split Factor', 'Last Split Date']


In [17]:
#DOING SOME DATA CLEANSING
for fieldname in list_of_fields:
    df[fieldname]= df[fieldname].fillna('0')

print(df)

   ticker country      sector RefIndex Market Cap Enterprise Value  \
0    GOOG     USA  Technology      SPY    714.29B          622.32B   
1    MSFT     USA  Technology      SPY    796.08B          766.07B   
2      FB     USA  Technology      SPY     382.9B          372.86B   
3       T     USA  Technology      SPY    217.03B          397.49B   
4    ORCL     USA  Technology      SPY    175.98B          175.22B   
5      VZ     USA  Technology      SPY    230.81B          349.16B   
6     TSM     USA  Technology      SPY    184.88B          174.58B   
7    INTC     USA  Technology      SPY    207.98B          233.12B   
8    CSCO     USA  Technology      SPY    193.96B          189.06B   
9     IBM     USA  Technology      SPY    105.81B          141.68B   
10   NVDA     USA  Technology      SPY     84.49B           83.73B   
11    ACN     USA  Technology      SPY     96.23B            94.6B   
12    TXN     USA  Technology      SPY     86.96B           90.49B   
13    VOD     USA  T

In [18]:
#LIST OF FIELDS, EXCLUDING THE DATES
list_of_fields_nodates = ['Market Cap', 'Enterprise Value', 'Trailing P/E', 'Forward P/E', 'PEG Ratio', 'Price/Sales', 'Price/Book', 'Enterprise Value/Revenue', 'Enterprise Value/EBITDA', 'Profit Margin', 'Operating Margin', 'Return on Assets', 'Return on Equity', 'Revenue', 'Revenue Per Share', 'Quarterly Revenue Growth', 'Gross Profit', 'EBITDA', 'Net Income Avi to Common', 'Diluted EPS', 'Quarterly Earnings Growth', 'Total Cash', 'Total Cash Per Share', 'Total Debt', 'Total Debt/Equity', 'Current Ratio', 'Book Value Per Share', 'Operating Cash Flow', 'Levered Free Cash Flow', '52-Week Change', 'S&amp;P500 52-Week Change', '52 Week High', '52 Week Low', '50-Day Moving Average', '200-Day Moving Average', 'Avg Vol (3 month)', 'Avg Vol (10 day)', 'Shares Outstanding', 'Float', '% Held by Insiders', '% Held by Institutions', 'Shares Short', 'Short Ratio', 'Short % of Float', 'Shares Short (prior month)', 'Forward Annual Dividend Rate', 'Forward Annual Dividend Yield', 'Trailing Annual Dividend Rate', 'Trailing Annual Dividend Yield', '5 Year Average Dividend Yield', 'Payout Ratio']


In [19]:
#CREATING A FUNCTION TO TRANSFORM TEXT (example 3.17B) INTO NUMBER (REAL NUMBERS, example 3,170,000,000).
d = {
     'k':  3,
     'M':  6,
     'B':  9,
     'T': 12,
     '%': -2
}
def text_to_num(text):
    try:
        if text[-1] in d:
            num, magnitude = text[:-1], text[-1]
            return float(num) * 10 ** d[magnitude]  #this case is when "text" has T, B, M, k or %
        else:
            return float(text) #this case is when "text" is string but look like a numeric
    except:
        try:
            return 1.0*text #this is when "text" is already numeric
        except:
            return 0.0 #it will reach this case when it is impossible to transform into numeric


In [20]:
#Apply the function in a optimized way using Numpy Vectorize - Example on how to call the function: text_to_num('3.17B')    
for fieldname in list_of_fields_nodates:
    df[fieldname] = np.vectorize(text_to_num)(df[fieldname])

print(df)

   ticker country      sector RefIndex    Market Cap  Enterprise Value  \
0    GOOG     USA  Technology      SPY  7.142900e+11      6.223200e+11   
1    MSFT     USA  Technology      SPY  7.960800e+11      7.660700e+11   
2      FB     USA  Technology      SPY  3.829000e+11      3.728600e+11   
3       T     USA  Technology      SPY  2.170300e+11      3.974900e+11   
4    ORCL     USA  Technology      SPY  1.759800e+11      1.752200e+11   
5      VZ     USA  Technology      SPY  2.308100e+11      3.491600e+11   
6     TSM     USA  Technology      SPY  1.848800e+11      1.745800e+11   
7    INTC     USA  Technology      SPY  2.079800e+11      2.331200e+11   
8    CSCO     USA  Technology      SPY  1.939600e+11      1.890600e+11   
9     IBM     USA  Technology      SPY  1.058100e+11      1.416800e+11   
10   NVDA     USA  Technology      SPY  8.449000e+10      8.373000e+10   
11    ACN     USA  Technology      SPY  9.623000e+10      9.460000e+10   
12    TXN     USA  Technology      SPY

In [21]:
#CREATING THE GROUPS - Bucketing Normalization 
IN_MODEL = []
for fieldname in list_of_fields_nodates:
    try:
        df[fieldname+'_group'] = df.groupby(['sector'])[fieldname].transform(
                             lambda x: pd.qcut(x, [0.0714, 0.2143, .5, 0.7857, 0.9286, 1.], labels=range(1,6)))
        df[fieldname+'_group']= 5 - df[fieldname+'_group'].fillna(0)
        IN_MODEL.append(fieldname+'_group')
    except:
        df[fieldname+'_group'] = 0
        continue
print(df)

   ticker country      sector RefIndex    Market Cap  Enterprise Value  \
0    GOOG     USA  Technology      SPY  7.142900e+11      6.223200e+11   
1    MSFT     USA  Technology      SPY  7.960800e+11      7.660700e+11   
2      FB     USA  Technology      SPY  3.829000e+11      3.728600e+11   
3       T     USA  Technology      SPY  2.170300e+11      3.974900e+11   
4    ORCL     USA  Technology      SPY  1.759800e+11      1.752200e+11   
5      VZ     USA  Technology      SPY  2.308100e+11      3.491600e+11   
6     TSM     USA  Technology      SPY  1.848800e+11      1.745800e+11   
7    INTC     USA  Technology      SPY  2.079800e+11      2.331200e+11   
8    CSCO     USA  Technology      SPY  1.939600e+11      1.890600e+11   
9     IBM     USA  Technology      SPY  1.058100e+11      1.416800e+11   
10   NVDA     USA  Technology      SPY  8.449000e+10      8.373000e+10   
11    ACN     USA  Technology      SPY  9.623000e+10      9.460000e+10   
12    TXN     USA  Technology      SPY

In [22]:
#DEVELOPING A SIMPLE MODEL
output_var = 'Beta'
X = df[list(IN_MODEL)]
y = df[output_var]

try:
    model = sm.OLS(y.astype(float),X.astype(float))
    result = model.fit()
    print (result.summary())
    y_pred = result.predict(X)
except np.linalg.linalg.LinAlgError as err:
    if 'Singular matrix' in err.message:
        print ("MODEL-INVALID (Singular Matrix)")
    else:
        raise

df['pred'] = y_pred



                            OLS Regression Results                            
Dep. Variable:                   Beta   R-squared:                       0.887
Model:                            OLS   Adj. R-squared:                  0.789
Method:                 Least Squares   F-statistic:                     9.039
Date:                Thu, 20 Dec 2018   Prob (F-statistic):           8.25e-12
Time:                        11:59:39   Log-Likelihood:                -38.053
No. Observations:                  86   AIC:                             156.1
Df Residuals:                      46   BIC:                             254.3
Df Model:                          40                                         
Covariance Type:            nonrobust                                         
                                           coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------------------------------


In [23]:
# SAVING TO EXCEL ALL PREDICTED  DATA!
df.to_excel("yahoo_ticker_sample_scraped_grouped_predicted.xlsx")
print (df)

   ticker country      sector RefIndex    Market Cap  Enterprise Value  \
0    GOOG     USA  Technology      SPY  7.142900e+11      6.223200e+11   
1    MSFT     USA  Technology      SPY  7.960800e+11      7.660700e+11   
2      FB     USA  Technology      SPY  3.829000e+11      3.728600e+11   
3       T     USA  Technology      SPY  2.170300e+11      3.974900e+11   
4    ORCL     USA  Technology      SPY  1.759800e+11      1.752200e+11   
5      VZ     USA  Technology      SPY  2.308100e+11      3.491600e+11   
6     TSM     USA  Technology      SPY  1.848800e+11      1.745800e+11   
7    INTC     USA  Technology      SPY  2.079800e+11      2.331200e+11   
8    CSCO     USA  Technology      SPY  1.939600e+11      1.890600e+11   
9     IBM     USA  Technology      SPY  1.058100e+11      1.416800e+11   
10   NVDA     USA  Technology      SPY  8.449000e+10      8.373000e+10   
11    ACN     USA  Technology      SPY  9.623000e+10      9.460000e+10   
12    TXN     USA  Technology      SPY

However, Rita faced multiple questions:
1.	Was this model reasonable? How can I evaluate it?
2.	Assuming that the model is improvable, how can I improve the model? What can I look at, adjust or change in my process to improve the model?
3.	Assuming that my model is good:
    3. How can I transform my prediction into the letters AAA, BBB, BB, etc?
    3. Where can an SME apply this model once it has been built?
    3. If we decide to sell our Rating for thousands of Euros, who pays the Rating, the analyzed company or the company that wants to consult the Rating? Can there be any conflict of interest? Did we achieve the objective of giving access to Rating reports to SMEs and make them available to SMEs?
    3. If we charge our Rating report cheaply, how can we make the product profitable knowing that it may cost more to produce it than to sell it? Does this change the possible conflict of interest? Did we achieve the objective of giving access to Rating reports to SMEs and make them available to SMEs? 
    


When Rita, very interested and intrigued, asked Sasha her questions, Sasha asked her to answer these questions herself and recommended her to go to his next class, where precisely this case would be discussed. This way they would be able to try to answer all these unknowns. In addition, he pointed out that since Rita had followed all the steps indicated by him, she would be very interested in the class.