## Exercises: Part 1/2 Machine Learning
### Bruno Borges da Silva

**1. Explain whether each scenario is a classification or regression problem, and indicate whether we are most interested in inference or prediction. Finally, provide 𝑛 and 𝑝**

**a. We collect a set of data on the top 500 firms in the US. For each firm we record profit, number of employees, industry and the CEO salary. We are interested in understanding which factors affect CEO salary.**  
> It’s a regression problem because we are interested in a continuous variable (CEO salary). We are interested in the inference because the goal is to understand the relationship between the predictors (profit, number of employees, industry) and the outcome (CEO salary). n: 500 (the top firms); p: 3 (profit, number of employees, industry).

**b. Success or a failure. We collect data on 20 similar products that were previously launched. For each product we have recorded whether it was a success or failure, price charged for the product, marketing budget, competition price, and ten other variables**
>b. It’s a classification problem because the outcome is a categorical variable (success or failure). The goal is prediction, because we want to predict if a new product will be a success or a failure based on historical data. n: 20 (similar products), p: 13.


**c. We are interested in predicting the % change in the USD/Euro exchange rate in relation to the weekly changes in the world stock markets. Hence, we collect weekly data for all of 2012. For each week we record the % change in the USD/Euro, the % change in the US market, the % change in the British market, and the % change in the German market.**
>c. It’s a  regression problem because the goal is predicting a  continuous variable (% change in the USD/EUR exchange rate). It’s a prediction, as we aim to forecast the future value of the exchange rate based on other continuous variables. n: 52 (data was collected weekly for all of 2012 = 52 weeks), p: 3 (percentage changes in the US market, the British market, and the German market.), n: 52 (assuming data is collected for every week without missing any week for all year).


**2. Describe three real-life applications in which classification might be useful. Describe the response, as well as the predictors. Is the goal of each application inference or prediction? Explain your answer**

| Application                           | Response                               | Predictors                                           | Goal       | Explanation                                                                                                              |
|---------------------------------------|----------------------------------------|------------------------------------------------------|------------|--------------------------------------------------------------------------------------------------------------------------|
| Patient Readmission Prediction        | Whether a patient will be readmitted within 1 month (Yes/No) | Patient's age, diagnosis, previous admissions, drugs prescribed | Prediction | The aim is to predict the % of a patient's readmission to hospital, which can improve patient care and resource planning. |
| Drug Effectiveness Classification (Pharmaceutical) | Effectiveness of a new drug (Effective/Not Effective) | Patient demographics (age, sex), health conditions, drug dosage, treatment duration, side effects observed  | Inference  | The aim is to infer which factors contribute most to a drug's effectiveness, helping in the development of new drugs.      |
| Grocery Store Product Demand Forecasting | High or low demand for a product                 | Historical sales data, season, price, promotions, competitor prices, local events          | Prediction | The aim is to predict product demand (high or low), to optimize inventory management and pricing strategies.               |


**3. Describe three real-life applications in which regression might be useful. Describe the response, as well as the predictors. Is the goal of each application inference or prediction? Explain your answer**

| Application                   | Response                             | Predictors                                                     | Goal         | Explanation                                                                     |
|-----------------------------|------------------------------------|--------------------------------------------------------------|------------|-------------------------------------------------------------------------------|
| Energy Consumption Prediction | Energy consumption of a building     | Building size, type, number of occupants, weather conditions, time of year | Prediction   | Predict the energy needs for heating and cooling to optimize energy use.        |
| Traffic Flow Prediction       | Number of vehicles on a road per hour| Time of day, day of the week, holidays, weather conditions, local events | Prediction   | Predict traffic flow to aid in urban planning and management, and inform travelers for better route planning. |
| Hospital Length of Stay Prediction | Length of stay in days             | Patient age, type of surgery, comorbidities, complications, admission type | Inference   | Understand how different factors such as surgery type and patient health conditions influence the length of hospital stays. This aids in healthcare policies. |


**4. Describe three reasons why the error term 𝜖 in the general form 𝑌 = 𝑓(𝑥) + 𝜖 might be non-zero?**
> The error *term 𝜖* in the regression equation  *𝑌 = 𝑓(𝑥) + 𝜖* represents the <u>discrepancy between the observed values and the values predicted by the model. Some reasons why this error term might be non-zero are</u>:  
**`Measurement Error:`** It happens when there is a difference between the actual value of a variable and the measured value, due to inaccuracies in data collection, recording errors, or limitations in measurement instruments. For example, in a study measuring the effect of study time on exam scores, inaccuracies in reporting the exact study time would contribute to the error term.  
**`Omitted Variable Bias:`** It happens when the model doesn’t include one or more relevant variables that influence the response variable. The effect of these omitted variables on the response is captured in the error term, leading to non-zero errors. For example, when the goal is predicted in the house prices using only square footage and location, omitting the number of bedrooms or the age of the property can lead to significant prediction errors.  
**`Model Specification Error:`** If the functional form of the model doesn’t accurately represent the true relationship between the predictors and the response variable, this can lead to errors. This includes using a linear model for a relationship that is inherently non-linear, or assuming independence of observations when there is actually correlation. 

**5. Use the following Linear Regression code to train (model.fit()) a linear regression model where the input variable is news sentiment score and the response (output) is stock price:**

![image.png](attachment:image.png)

In [21]:
from sklearn.linear_model import LinearRegression
import numpy as np

# Create an array (matriz) of sentiment scores from news articles. Each number in the list represents the sentiment score.
# Reshape (remodelar) the matriz to a 2-D array with one column (necessary for sklearn)
# Each sentiment score represents, for example, how positive or negative the news is
news_sentiment = np.array([0.2, 0.5, 0.3, -0.1, 0.4, 0.6, 0.1, -0.2, 0.3, 0.0]).reshape(-1,1)

# Create a list of stock prices corresponding to the sentiment scores above
# Each price corresponds to the stock price at the time of the news sentiment score
stock_price = [50, 55, 48, 45, 52, 58, 53, 47, 51, 50]

# Initialize the LinearRegression model
model = LinearRegression()

# Fit the model using the news sentiment as predictor variable (X) and the stock prices as response variable (y)
model.fit(news_sentiment, stock_price) # model.fit(X,y)

# After fitting the model, print out the intercept and the coefficient(s)
print(f"Intercept: {model.intercept_}, Coefficients: {model.coef_}")
print("\n","-"*70)

Intercept: 48.37931034482759, Coefficients: [12.00328407]

 ----------------------------------------------------------------------


**`intercept (𝛽0)`** is the predicted value (y) of the response when the predictor (X) is zero:  y = aX + 𝛽0  -->  y = a*0 + 𝛽0  -->  **y = 𝛽0**  In other words, **`when stock_price (X) is 0, the news sentiment score (y) = intercept (𝛽0)`**

**`coefficients is 𝛽1, … , 𝛽𝑝`** each coefficient tells how much the dependent variable changes when the corresponding independent variable (or predictor) increases by one unit.

For example, in a linear regression model to predict the `price of a house (the response)` based on its `size in square feet (predictor 1)` and `age in years (predictor 2)`. There are two predictors (`p = 2`), leading to two coefficients: `β1` (for the size in square feet) and `β2` (for the age in years). If b1 = 150, its means for every additional square foot of size, the predicted price of the house increases by $150 (assuming the age of the house stays the same)

**a. Given the parameters produced by the model in the code, what is the stock price for a news sentiment score 0.55? What do the parameters tell us about the relationship between news sentiment score and stock price?**

the stock price for a news sentiment score of 0.55 is predicted to be approximately $62.65. This indicates a positive relationship between news sentiment and stock price, where a higher sentiment score leads to an increase in stock price. Specifically, the model suggests that for every unit increase in sentiment score, the stock price increases by about 23.53 units. 

In [2]:
When you run this code, it will output the intercept and coefficient values for the linear regression model
that predicts stock prices from the news sentiment scores.

SyntaxError: invalid syntax (2167480645.py, line 24)

In [1]:
from sklearn.linear_model import LinearRegression
import numpy as np

news_sentiment = np.array([0.2, 0.5, 0.3, -0.1, 0.4, 0.6, 0.1, -0.2, 0.3, 0.0]).reshape(-1,1)
stock_price = [50, 55, 48, 45, 52, 58, 53, 47, 51, 50]

model = LinearRegression()
model.fit(news_sentiment,stock_price)
print(f"Intercept: {model.intercept_}, Coefficients: {model.coef_}")

Intercept: 48.37931034482759, Coefficients: [12.00328407]
