# Transfer Learning Using fastText API
Modified from [this fastText tutorial.](https://fasttext.cc/docs/en/supervised-tutorial.html) fastText was developed by Facebook to compete with more time-intensive deep learning approaches, and uses a [multinomial logistic regression algorithm](https://towardsdatascience.com/fasttext-bag-of-tricks-for-efficient-text-classification-513ba9e302e7).

No preprocessing has yet been done to our text features. We will need to combine all three of our text features into one to retain the most signal, and then ensure that the text in our feature is UTF-8 encoded. New lines, extra spaces, etc. will be removed. 

This is a non-ordinal classifier, so the reduced grade will be used, with the prefix "\_\_label\_\_" appended for use by the model. Each row is then read into a .txt file with the label followed by the document.

In [None]:
#imports
import pandas as pd
import numpy as np
import fasttext
import re

from sklearn.feature_extraction.text import CountVectorizer

In [None]:
def run_fasttext(train, val, test, target_col, x_cols):
    """
    Creates a fastText model trained and evaluated on the given DataFrames with the given target and exogenous variables
    target_col should be a single column name
    x_cols should be a list (even if only a single column is to be used)
    """
    #fastText-specific preprocessing
    for df in (train, val, test):
        df['target'] = '__label__' + df['grade_reduced'].astype(str)
        df['exog'] = df[x_cols].apply(lambda x: " ".join([str(val) for val in x]), axis=1)
        
    #save to txt files - https://stackoverflow.com/questions/31247198/python-pandas-write-content-of-dataframe-into-text-file
    train[['target', 'exog']].to_csv("./data/train.txt", header=False, index=False, sep=" ")
    val[['target', 'exog']].to_csv("./data/val.txt", header=False, index=False, sep=" ")
    test[['target', 'exog']].to_csv("./data/test.txt", header=False, index=False, sep=" ")
    
    #train
    model = fasttext.train_supervised(input="./data/train.txt")
    
    #evaluate - precision/recall
    stats = {k: model.test(f"./data/{k}.txt") for k in ("train", 'val', 'test')}                      
    for k, v in stats.items():
        print(f"{k.title()}\tPrecision:\t{round(v[1],3)}\n\tRecall:\t\t{round(v[2],3)}")
    
    #evaluate - test accuracy
    y_pred = test['exog'].apply(lambda x: model.predict(x))
    print()
    print(f"Test Accuracy: {round((y_pred.map(lambda x: x[0][0]) == test['target']).mean(), 3)}")
    
    return model

In [None]:
#load in data
train = pd.read_csv("./data/train.csv")
val = pd.read_csv("./data/val.csv")
test = pd.read_csv("./data/test.csv")

In [None]:
model = run_fasttext(train, val, test, 'grade_reduced', ['lemmatized_text_combined'])

# TALK MORE ABOUT BERT

Here we see that the precision and recall for this model are around 33%, very similar to the BERT implementation. Moving forward we will:
- Remove corpus-specific stop words
- Add more features - climb type

### Further Data Cleaning

In [None]:
#https://stackoverflow.com/questions/6116978/how-to-replace-multiple-substrings-of-a-string
def multiple_replace(string, rep_dict):
    pattern = re.compile("|".join([re.escape(k) for k in sorted(rep_dict,key=len,reverse=True)]), flags=re.DOTALL)
    return pattern.sub(lambda x: rep_dict[x.group(0)], string)

In [None]:
#use CountVectorizer to find corpus-specific stop words
cvec = CountVectorizer()
cvec_df = pd.DataFrame(cvec.fit_transform(train['lemmatized_text_combined']).todense(), columns=cvec.get_feature_names_out())

In [246]:
#top most common words
cvec_df.sum().sort_values(ascending=False).head(20)

bolt      158145
right     113042
climb      97934
route      92252
anchor     87819
crack      77795
face       66288
leave      60313
start      58287
left       50225
ledge      41138
pitch      39935
crux       35456
corner     34944
wall       34183
good       33553
roof       33553
foot       33422
small      33413
rock       32787
dtype: int64

In [None]:
#top 20 stopwords
words_to_remove = {k + " ": " " for k in cvec_df.sum().sort_values(ascending=False).head(20).index}
for df in (train, val, test):
    df['stopped_lemmatized_text_combined'] = df['lemmatized_text_combined'].apply(lambda x: multiple_replace(x, words_to_remove))

In [None]:
#check what the new top words are
cvec = CountVectorizer()
stopped_cvec_df = pd.DataFrame(cvec.fit_transform(train['stopped_lemmatized_text_combined']).todense(), columns=cvec.get_feature_names_out())
stopped_cvec_df.sum().sort_values(ascending=False).head(20)

In [None]:
train['stopped_lemmatized_text_combined']

In [None]:
run_fasttext(train, val, test, 'grade_reduced', ['stopped_lemmatized_text_combined'])