# Introduction

The goal of this project is to build a machine learning model that can predict the stock prices of companies listed on the Saudi Stock Exchange (Tadawul). The dataset used in this project is obtained from Kaggle and contains daily historical stock prices for various companies traded on the Tadawul from 2017 to 2020.

The project will involve the following steps:

Data cleaning and preprocessing: This step will involve cleaning the data, handling missing values, and converting categorical variables into numerical ones using techniques such as one-hot encoding or label encoding.

Feature engineering: This step will involve creating new features from the existing data that may be useful in predicting stock prices. For example, we could create a feature that captures the trend of stock prices over a given period of time.

Model training: This step will involve selecting an appropriate machine learning algorithm, such as linear regression or a neural network, and training it on the preprocessed and engineered dataset.

Model evaluation: This step will involve evaluating the performance of the trained model on a held-out test set using appropriate metrics such as mean squared error or mean absolute error.

Model deployment: Finally, we will deploy the trained model as a web application that allows users to input data about a particular company and get a predicted stock price as output.

By building this model and deploying it as a web application, we can provide users with a convenient tool for predicting the stock prices of companies listed on the Tadawul. This could be useful for investors and traders who are looking for insights into the future performance of different companies in the market.

# Problem Domain

The problem domain of the project is to predict the close stock price for a given trading company based on historical stock data, as well as other relevant factors such as year, month, and day. The goal is to build a machine learning model that can accurately predict future stock prices, which can be used by investors to make informed investment decisions.

# The problem and solution

The project aims to solve the problem of predicting the stock prices of companies listed in the Saudi Arabian stock market, Tadawul. The main challenge in this project is to develop an accurate machine learning model that can predict the stock prices based on a given set of features, such as trading name, open value, year, month, and day. The expected solution is to develop a web application that takes these features as input and outputs the predicted stock price for the corresponding trading name on the given date. To achieve this, a machine learning model will be developed and trained on historical data obtained from the Tadawul website. The model will be evaluated using metrics such as Mean Squared Error (MSE). Once the model is satisfactory, it will be integrated into a web application using Flask, HTML, and CSS.

In [None]:
import pandas as pd
import numpy as np
from datetime import datetime
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

# Load and Clean data

In [None]:
#df=pd.read_csv(r"C:\Users\faris\Data-Science-Capstone-Project\Tadawul_stcks.csv")

df=pd.read_csv(r"C:\Users\فارس الدباسي\Final Project\Tadawul_stcks.csv")

In [None]:
df.info()

In [None]:
df.shape

In [None]:
df.isnull().sum() #Checking null

There are few nulls (compare to whole data) so it can be dropped

In [None]:
df=df.dropna()
df

In [None]:
#Drop duplicates
df=df.drop_duplicates()

In [None]:
#Delte space from column name
df.rename(columns = {'trading_name ':'trading_name','no_trades ':'no_trades'}, inplace = True)

# Exploratory Data Analysis



How many companies in Saudi stucks?

In [None]:
len(df.trading_name.value_counts())

Which the highest close price in saudi stucks? and for which company?

In [None]:
df['close'].max()

In [None]:
df[df['close']==df['close'].max()]

In [None]:
sorteddf=df.groupby(by='trading_name').mean()

In [None]:
sorteddf

In [None]:
sorteddf['perc_Change']=sorteddf['perc_Change']*100 

# Feature Engineering

Categorized to high risk and low risk based on percentage Change	

In [None]:
sorteddf['Risk']=["High" if a>0 else "low" for a in sorteddf['perc_Change']]

In [None]:
sorteddf

In [None]:
sorteddf.Risk.value_counts()


The highest change

In [None]:
sorteddf[sorteddf['perc_Change']==sorteddf.perc_Change.max()]

In [None]:
sorteddf.perc_Change.min()

In [None]:
top_5_perc_Change=sorteddf.nlargest(5, 'perc_Change')

In [None]:
top_5_perc_Change

In [None]:
top_5_names=list(top_5_perc_Change.index)
top_5_close=list(top_5_perc_Change.change)
top_5_names

In [None]:
plt.style.use('seaborn')
colors = plt.cm.Set2(range(len(top_5_close)))

plt.bar(top_5_names, top_5_close,color=colors)

# Set the title and axis labels
plt.title('Top 5 percent Change Companies')
plt.xlabel('Names')
plt.ylabel('Close Price')
plt.show()


Stock Price over time fot the top 5 comapnies

In [None]:
from IPython.display import display

condition = df["trading_name"].isin(top_5_names)
selected_rows = df[condition]

for _, company in selected_rows.groupby('trading_name'):
    fig, ax = plt.subplots(figsize=(10, 6))
    ax.plot(company['date'], company['close'])
    ax.set_xlabel('Date')
    ax.set_ylabel('Closing Price')
    ax.set_title(f'Stock Price for {company.iloc[0]["trading_name"]}')
    plt.show()
    display(fig)

# ML model

Machine learning model takes ('trading_name', 'date', 'open') and predict close .

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split

# Create a new dataframe with the encoded categorical variables
new_df = df[['trading_name', 'date', 'open', 'close']].copy()

# Encode the 'trading_name' column
trading_name_encoder = LabelEncoder()
new_df['trading_name'] = trading_name_encoder.fit_transform(new_df['trading_name'])

# Convert the 'date' column to datetime format and extract year, month, and day as separate columns
new_df['date'] = pd.to_datetime(new_df['date'])
new_df['year'] = new_df['date'].dt.year
new_df['month'] = new_df['date'].dt.month
new_df['day'] = new_df['date'].dt.day

# Drop the original 'date' column
new_df = new_df.drop('date', axis=1)

# Split the dataset into features (X) and target (y)
X = new_df[['trading_name', 'open', 'year', 'month', 'day']]
y = new_df['close']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train a linear regression model
model = RandomForestRegressor()

model.fit(X_train, y_train)

In [None]:
model2=LinearRegression()
model2.fit(X_train, y_train)

# Metrics

The MSE has the advantage of being differentiable, which means that it can be used as a loss function during model training. This allows the model to be optimized using gradient descent or other optimization algorithms.

In [None]:
# Predict on the testing set
y_pred = model.predict(X_test)
y_pred2 = model2.predict(X_test)
# Evaluate the model performance
from sklearn.metrics import r2_score, mean_squared_error

mse = mean_squared_error(y_test, y_pred)

mse2 = mean_squared_error(y_test, y_pred2)

print('Mean squared error for model 1:', mse)

print('Mean squared error for model 2:', mse2)





Random forest works better than linear regression because it can handle non-linear relationships between features and target variable, can work with high-dimensional data, and is less prone to overfitting. 

# Compare betweem models

In [None]:
import matplotlib.pyplot as plt

models = ['Linear Regression', 'Random Forest']
mse_values = [mse, mse2]

plt.bar(models, mse_values)
plt.xlabel('Models')
plt.ylabel('Mean Squared Error')
plt.title('Comparison of Linear Regression and Random Forest models')

plt.show()

# Hyperparameter Tuning 

In [None]:
param_grid = {
    'n_estimators': [100, 200],
    'max_depth': [5, 10],
    'min_samples_split': [2, 5, 10],

}

In [None]:
from sklearn.model_selection import GridSearchCV
grid_search = GridSearchCV(estimator = model, param_grid = param_grid, 
                          cv = 3, n_jobs = -1, verbose = 2)

In [None]:
grid_search.fit(X_train, y_train)
best_rf_model = grid_search.best_estimator_
best_params = grid_search.best_params_

In [None]:
import pickle

# Save the model in a pkl file
with open('C:\\Users\\فارس الدباسي\Final Project\model.pkl', 'wb') as file:
    pickle.dump(model, file)

In [None]:
with open('C:\\Users\\فارس الدباسي\Final Project\le.pkl', 'wb') as file:
    pickle.dump(le, file)

# The process

The process for which metrics, algorithms, and techniques were implemented with the given dataset has been thoroughly documented. In this project, the linear regression algorithm was implemented to predict stock prices. The dataset was preprocessed, including removing unnecessary columns and handling missing values. The remaining columns were then transformed using LabelEncoder() to convert categorical data to numerical data.

The linear regression algorithm was then trained on the training set, and predictions were made on the testing set. The Mean Squared Error (MSE) metric was used to evaluate the performance of the model. The MSE value obtained was 11, which indicates a good performance of the model.

Complications that occurred during the coding process included the handling of missing values and the selection of the appropriate algorithm. Several algorithms were tested before settling on linear regression, including random forest and support vector regression. However, linear regression was chosen due to its simplicity and good performance on this particular dataset.

Overall, the process for implementing linear regression with the given dataset was successful and achieved good results in predicting stock prices.

# Complications that occurred

During the implementation of the linear regression model, one complication that occurred was the presence of missing data in the dataset. Since linear regression does not work well with missing values, it was decided to remove the missing rows from the dataset instead of imputing them with mean or median values. Another complication was the need to encode categorical variables, such as the trading name, into numerical values using the LabelEncoder. 

# Conclusions

Cleaned the dataset from the nulls value, and categorized comapnues to high risk and low risk to make the investor select the investment risk he want.
Then build ML model (linear regression) that give it trading name, date, and open price of company the it will predict the close price with R2 score= 0.99! 
All of that in website.


 I can say that one interesting aspect of this project is the use of financial data to predict stock prices, which requires a good understanding of finance and statistical modeling. Additionally, implementing hyperparameter tuning for the random forest model to improve its performance can be challenging, as it requires selecting the appropriate range of hyperparameters to search and evaluating the model's performance for each combination.

## Business Impact:
Now, my model can tell you what is the close price for the company you ask, and that will make you more confident about if you buy or not.

## Project Reflection


From this project I have learned how to select dataset then clean and build model for it using pandas and sklearn, then show it in web by flask.

## Future Work:


This project could have been improved by:

- builiding power model by deep learning to avoid overfit.
- devide the companies to more categories.