# Model Training Automation Project

This is your last project for this semester! This one has three goals:
- To have you create a pipeline of Machine Learning for use in a real project, the one you could have done for the Hackaton (if it had been better set).
- Link blocks of knowledge you already have, and consolidate it.
- Make you able to create your own project, almost from A to Z.

This project is divided in 3 or 4 stages. Each stage will be graded. At the end of each stage, I will give you the solution of the stage so you can follow up on the next one.

## Stage 1: Create database

I give you two data sources :
* The [Movie Database API](https://developers.themoviedb.org/3/getting-started/introduction). Search for all the movies containing `cat`.
* The [IMDB Kaggle dataset](https://www.kaggle.com/PromptCloudHQ/imdb-data#IMDB-Movie-Data.csv). Keep only films that have the letters `cat` in their description (it does not have to be a single word, so `Catastrophe` should match)

You read the data and transform it to be able to store it in a Database of your choice. 

A SQL db or Neo4J would be the easiest to set up, but I will totally approve the choice of Elastic Search or Hadoop. You can even use MongoDB if you'd like.

It's okay to have the same movie in both sources - keep all the infos, do not remove one of them.

Of course, you should process and store this data on your database fully in Python.

You can use a DataFrame to help you aggregate the data.

* **Take a screenshot** of your database results and send it alongside your code.

* **Deadline**: Tuesday, 5th of May, 23h42.


## Stage 2: Make training and prediction available on API

Read the data from your database.

Step 1. You have three functions to write:

1. `get_model()` will return a untrained scikit-learn model.
* The second one will use train this model for prediction, using database data.
* The third one will use the trained model to predict.

We don't care about the precision of the model for now, don't spend time thinking about it.

The model has to predict the movie's title based (at least) on its description/overview. (It can be a very bad prediction, just make sure you get a description/overview in input and return a movie title)

When you are done with those functions, you can start the next tasks:

Step 1. Create a Flask API with two endpoints:

1. The first one is `GET /model/train`. It will call the second function, that trains the model based on database data.
* The second one is `POST /model/predict`. It will take a sample of features as data, use the third function to predict the target and return the result.

* **Deadline**: Tuesday, 12th of May, 23h42.


## Stage 3: Make more available on API

Your model and datasets needs to live. How?

Add two endpoints to your Flask API:
* `POST /data` to add a new sample (features + target) to your training dataset.

This way, you should be able to run the Train endpoint and see your model update.

* `GET /model/train_test` to evaluate your model from `get_model()`.

This function have to:
1. Read the data
- Transform it into X and y
- Use train_test_split to get X_train, X_test, y_train y_test
- Train the model from `get_model()` on X_train, y_train
- Compare your predictions on X_test with y_test

This make you able to run `GET /model/train_test` after changing the content of `get_model()` function and evaluate the MAE evolution easily.

* **Deadline**: Tuesday, 19th of May, 23h42.


## Stage 4 (Bonus):

Optional stage, only if you feel like it. Select one of more of the upgrade you would like to do:

* Make your API RESTful.
* Improve the text treament so you have more features to predict movie's names.
* Improve your model to predict movie's names.


* **Deadline**: Tuesday, 26th of May, 23h42.


# Constraints

* Send your code in .ipynb or .py files, as you wish.
* **Take a screenshot** of your database results.
* Send your `.env` file.
* Send your work to my inbox laure.daumal@ext.devinci.fr
* Respect the deadline for each stage.