# Modeling
Iterative modeling for predicting grade. Due to the small amount of training data, we will be predicting 

### Ordinal Classification Methods
##### Reduce to Binary Classifiers:
[Create a custom Ordinal Classifier Class](https://towardsdatascience.com/simple-trick-to-train-an-ordinal-regression-with-any-classifier-6911183d2a3c)
- Pr(y=1) = 1-Pr(Target > 1)
- Pr(y=2) = Pr(Target>1)-P(Target > 2)
- Pr(y=3) = Pr(Target>2)-P(Target > 3)
- Pr(y=4) = Pr(Target>3)-P(Target > 4)
- Pr(y=5) = Pr(Target>4)

Choose the Max of these 5 equations (many more in our case). Can use any binary classifier.

##### Mord Implementations
Would take more research to choose the best of [these](https://pythonhosted.org/mord/reference.html#mord.LogisticAT)
[more about mord](https://fa.bianp.net/blog/2013/logistic-ordinal-regression/)

##### Statsmodels Implementations
Probably the best for out-of-the-box, choose between probit and logit. Does not require a constant [statsmodels page](https://www.statsmodels.org/stable/examples/notebooks/generated/ordinal_regression.html) 

##### [Statsmodels Walkthrough](https://analyticsindiamag.com/a-complete-tutorial-on-ordinal-regression-in-python/)
Make the target column actually ordinal:
```from pandas.api.types import CategoricalDtype
cat_type = CategoricalDtype(categories=['Fair', 'Good', 'Ideal', 'Very Good', 'Premium'], ordered=True)
data_diam["cut"] = data_diam["cut"].astype(cat_type)```
They try the probit and logit OrderedModel from statsmodels

In [None]:
import pandas as pd
import numpy as np
import scipy.stats as stats

from sklearn.feature_extraction.text import TfidfVectorizer
from statsmodels.miscmodels.ordinal_model import OrderedModel

In [None]:
#load in data
train = pd.read_csv("./data/train.csv")
val = pd.read_csv("./data/val.csv")
test = pd.read_csv("./data/test.csv")

In [None]:
#change type of target column to ordinal - https://analyticsindiamag.com/a-complete-tutorial-on-ordinal-regression-in-python/
cat_type = pd.CategoricalDtype(categories=range(16), ordered=True)

for df in (train, val, test):
    df['grade_reduced'] = df['grade_reduced'].astype(cat_type)

In [None]:
#make word matrices for modeling
tfidf = TfidfVectorizer(min_df=10)
train_vals = pd.DataFrame(tfidf.fit_transform(train['lemmatized_text_combined']).todense(), columns=tfidf.get_feature_names_out())
val_vals = pd.DataFrame(tfidf.fit_transform(val['lemmatized_text_combined']).todense(), columns=tfidf.get_feature_names_out())
test_vals = pd.DataFrame(tfidf.fit_transform(test['lemmatized_text_combined']).todense(), columns=tfidf.get_feature_names_out())

In [None]:
model = OrderedModel(train['grade_reduced'], train_vals)

In [None]:
res = model.fit(method='lbfgs')
res.summary()

RUNNING THE L-BFGS-B CODE

           * * *

Machine precision = 2.220D-16
 N =         9017     M =           10

At X0         0 variables are exactly at the bounds

At iterate    0    f=  2.07198D+00    |proj g|=  1.52899D-02


 This problem is unconstrained.



At iterate    1    f=  2.02950D+00    |proj g|=  1.21633D-02
