We have run our scripts on the GPU server 18.104.22.168. The data is present on the same server.
Raw data for cases and stock price change can be found at path-
# Main Directory /data/WorkData/firmEmbeddings/ # Case Data is inside the directory CaseData/ # Stock Data is inside the directory StockData/
The data after processing and joining can be found at path -
# Main Directory /data/WorkData/firmEmbeddings/Models/ # Random Forest for Stock Prediction Data StockPredictionUsingRandomForest/ # Neural Network for Stock Prediction Data StockPredictionUsingNeuralNetwork/Data_stock_all/ # Neural Network for Firm Embeddings Data FirmEmbeddings/
To install the packages for running all the scripts execute the command-
chmod 755 requirements.sh sh requirements.sh
Go to the python shell and execute the command for downloading punkt -
python3 >> nltk.download('punkt')
Scripts to process the raw case data-
These files are present in CaseData folder. Run the files in following order- The data generated from these scripts will be used with stock data in creating final data for training the models. This data can be found in /data/WorkData/firmEmbeddings/CaseData/ folder present on the server. 1. filterCases.ipynb - Filters cases from sentences folder to get cases for category 6 and 7. It uses bb2topic.pkl, bb2genis.pkl, caseid_date.csv. This generates new folder Filtered_1 and the files -filtered.pkl, casedata.pkl. The Filtered_1 contains all cases belonging to category 6 and 7. 2. ngramdataGenerate.ipynb - Filters bigram pickle files to get cases for category 6 and 7 . It uses casedata.pkl and build_vocab_lemma_pos/phrased/ and creates new folder PickleFiles. The PickleFiles contains all cases belonging to category 6 and 7. 3. bigram.ipynb- It creates final ngramdata.pkl. The code uses id2gram.pkl, casedata.pkl, df-tf.pkl and files from PickleFiles folder to generate data. 4. doc2vec.py- Uses text from Filtered_1 and runs doc2vec algorithm on filtered cases and generate doc2vec_2.model 5. modeltodata.ipynb - Uses casedata.pkl and doc2vec_2.model. It maps model vectors to case meta data and creates visualization of docvectors. The code produces following files docvector.pkl, traindocvector.pkl, testdocvector.pkl, validationdocvector.pkl
Script to process the raw Stock Data -
Run the script filterCompanies.py present in path StockData to process the stock data python3 filterCompanies.py
Script to join the two data sets -
These files are present in JoiningDataPrep folder 1. StockAndCaseDataJoined - joins case and stock data. This script uses stockData07to13_logdiff_5_0.1.csv and following docvector files - traindocvector.pkl, testdocvector.pkl, validationdocvector.pkl. And produces following files - training_data_CaseCompanyStockChange.pkl, testing_data_CaseCompanyStockChange.pkl, validation_data_CaseCompanyStockChange.pkl 2. ProcessJoinedDataForNN.ipynb - processes data for final run and creates val_data_final.pkl, train_data_final.pkl, test_data_final.pkl 3. Finaldata_stockPred.ipynb - produces final data for all cases and category 6 and 7 for stock prediction 4. Finaldata_firmEmbed.ipynb - produces final data for all cases and category 6 and 7 for firm embeddings and uses Company_meta.pkl 5. RankCompany.ipynb - used to create Company_meta_rank.pkl After running all these scripts, the data for all the models will be copied in their respective paths mentioned above.
Script to generate models for stock prediction and firm embeddings -
#Change file permissios to run the script chmod 755 RunAllmodels.sh # Run the following command to execute the script - sh RunAllmodels.sh This script contains three scripts. Path locations for the scripts on github are - 1. RunRandomForest.py is present in the directory Random_Forest/ 2. FirmEmbeddingsModel.py is present in the directory FirmEmbeddings/ 3. NeuralNetworkRun_3layers.py is present in the directory StockPrediction/ The script RunRandomForest.py will generate the Random Forest model and it will also plot the graph for actual vs predicted change in stock price. The predictions on test data after running the NeuralNetworkRun_3layers.py script are saved in predictions.txt in the same path in which data is present. The file predictions.txt along with actual.txt (which is also present in the same path as predictions.txt) will be used by the notebook StockPrediction/ScatterPlotPredictedvsActual.ipynb in plotting the actual/predicted stock price change. The notebook contains the absolute path for these files. Thus the notebook can also be run from anywhere on the GPU server. The firm embeddings matrix after running the script FirmEmbeddingsModel.py saves the matrix in the same path in which data is present. This matrix will be used by FirmEmbeddings/VisualizeFirmsEmbeddings.ipynb to visualize the embeddings. This notebook contains the Tsne plots for category 6, 7 and combines cases. It also contains the embeddings visualization against industries of the firms, ranking of the firms, states in which they lie. The notebook also contains the cosine similarity plots for the two categories - Finance and Manufacturing.