# Stock Market Analysis Project

In this project, you will need to apply the skills you've learned to perform basic analysis of stock market data using Python. 

You'll start by loading and cleaning the data from a CSV file, clean and preprocess it, perform statistical analysis, and finally, visualize the results.

## Objectives:
- Load and preprocess stock market data from a CSV file.
- Calculate basic statistical metrics for the stocks.
- Visualize stock price trends and relationships between different stocks.

## Submission of your work/solution:
1. To submit the work you did (in groups) you need to go to the menu (top-left) and choosing `File`->`Save Notebook`.
2. Then, go to the menu (top-left) and choose `File`->`Download`.
3. Save the file to somewhere on your computer and upload it to the Course Assignment named `Assignment: Stock Market Analysis Project`

## Bonus Section
- Select a simple price prediction method
- Train the model
- Predict values
- Evaluate vs the actual values

**Instructions:** Follow the steps below, completing the tasks as indicated and answer the questions at the end.

## Step 1: Setting up The Environment
First, we need to set up the Python environment with the necessary libraries. Execute the following cell to install the required packages if you haven't already.

In [None]:
%pip install pandas matplotlib seaborn

## Step 2: Importing Libraries
Now, import the necessary libraries for this project. 
### TODO:
- We will use Pandas for data manipulation, import it
- We will use Matplotlib and Seaborn for data visualization, import them

## Step 3: Loading the Data
The first task is to load the stock data from the CSV file we provided. 

### TODO:
- Use Pandas to read the data and store it in a DataFrame called `stocks`
  - The file is named `sample_stock_data.csv`

### TODO:
 - Display the first five rows to inspect the data.

## Step 4: Data Cleaning
Before analyzing the data, it's important to check for and handle any missing values or data anomalies. 
#### TODO:
- Check for missing values and print them

#### TODO:
 - Fill any missing data with the previous day's stock price (backfilling)

#### TODO:
 - **Re-Check** for missing values and print them

## Step 5: Statistical Analysis
Perform a basic analysis to understand the data better. Calculate the mean, median, and standard deviation for each stock's closing price. This will give you a basic idea of the location and variability of each stock.

#### TODO:
- Calculate the mean, median, and standard deviation for each stock's closing price
- Print the mean, median, and standard deviation for each stock's closing price

## Data Visualization
Create visualizations to better understand the data:
1. Time series plots for each stock.
2. Histograms showing the distribution of closing prices for each stock.
3. Scatter plots showing the relationship between different stocks.
4. A correlation matrix between the stocks.

#### TODO:
- Draw **line plots** for each stock's price over time

#### TODO:
- Draw **histograms** showing the distribution of closing prices for each stock.

#### TODO:
- Draw a **scatter plot** showing the relationship between:
  - **AAPL** vs **MSFT**

#### TODO:
- Draw a **scatter plot** showing the relationship between:
  - **AAPL** vs **GOOG**

#### TODO:
- Draw a **correlation matrix** using a heatmap between the stocks

## Summary
In this project, you've performed basic data loading, cleaning, statistical analysis, and visualization of stock market data. 

# Questions to answer (answer in the cell below in English or French)
- **Q1**: Which stocks seem to move together? 
- **Q2**: What trends have you noticed?

# Bonus part (optional)


### Introduction to Linear Regression

Linear regression is a fundamental statistical and machine learning technique where the goal is to predict a dependent variable (outcome) based on one or more independent variables (predictors). In the context of stock prices:

-   **Dependent variable (y)**: This is what you're trying to predict --- in this case, the current day's closing price of AAPL stock.
-   **Independent variable(s) (X)**: These are the inputs to the model --- for this example, the previous day's closing price of AAPL stock.

The model learns a line (in simple linear regression) or a hyperplane (in multiple linear regression) that best fits the data points. The best fit line is the one where the total sum of squared differences between the actual values and the predicted values (based on the line) is minimized.

For further reading and tutorials on linear regression, check out the following:

-   [Linear Regression in Python -- Real Python](https://realpython.com/linear-regression-in-python/)

By following these tutorials and practicing with real datasets, you can gain a deeper understanding of linear regression and how it can be applied to various types of data, including financial markets.


## Setting up Your Environment
First, we need to set up the Python environment with the necessary libraries. Execute the following cell to install the required packages if you haven't already.

In [None]:
%pip install scikit-learn

## Importing Libraries
Now, import the necessary libraries for this project. Provided ready-made for you.

You just need to execute the following cell.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

## Feature Engineering

#### TODO:
 - Create a new feature / column `AAPL_Previous` that represents the previous day's closing price.

This will be used as input (`X`) to predict the current day's price (`y`).

Hint: Look for the `shift` function

## Prepare the Data
We need to prepare the dataset for training and testing the model.

#### TODO:
- Drop the first row since it doesn't have a previous day's price due to the shift operation.

## Split the Data into Training and Testing Sets
You want to use separate part of the full dataset for training the model and for testing it. If you do not do this, the model has a high probability of overfitting.

#### TODO: 
- Divide the dataset into training and testing sets
  - Use 70% of the data for training and 30% for testing the model
  - You should get these four variables as a result: `X_train`, `X_test`, `y_train`, `y_test` 

Hint: use the `train_test_split` function.

## Create and train a Linear Regression Model

#### TODO:
- Create an "empty" Linear Regression model.

#### TODO:
- Train the Linear Regression model using the training data.

Hint: use the `fit` function

## Make Predictions and Evaluate the Model

#### TODO: 
- Using the trained model, make predictions on the test set.

Hint: use the `predict` function

#### TODO: 
 - Evaluate the model's performance using mean squared error and R^2 score.
 
Hint: look at the `mean_squared_error` and `r2_score` functions.

## Print Out Model Performance Metrics

It's important to understand how well the model is performing.
Lower Mean Squared Error and an R^2 score closer to 1 usually indicate a better fit.

#### TODO: 
- Print the Mean Squared Error
- Print the R^2 score

## Plot the Actual vs Predicted Prices

Let's visualize the model's predictions compared to the actual stock prices.
This will give us a visual understanding of how well the model is predicting.

#### TODO: 
- Plot the Actual vs Predicted AAPL Stock Prices
  - The X axis should be the Previous Day Closing Price
  - The Y axis should be the Next Day Closing Price
  - Do a line plot of the predicted Next Day Closing Price
  - Do a scatter plot of the actual Next Day Closing Price


## Plot the Actual vs Predicted Prices over time

Let's visualize the model's predictions compared to the actual stock prices over time
This will give us a visual understanding of how well the model is predicting in another visualisation.

In order to achieve this it is very useful to combine the actual and the predicted prices into a single DataFrame for easy plotting.

#### TODO:
- Create new DataFrame with two columns and the index equal to the inital index (`y_test.index`):
  - `Actual Price`: the `y_test`
  - `Predicted Price`: the `y_pred`

It's also imperative that the rows in the new DataFrame are ordered by date (it is not done automatically)

#### TODO:
- Sort the index of the new DataFrame

Hint: look at the `sort_index` function

#### TODO: 
- Plot the Actual vs Predicted AAPL Stock Prices over time
  - The X axis should be the Date
  - The Y axis should be the Stock Price
  - Do a line plot of the Predicted Price
  - Do a line plot of the Actual Price