## Task
Given a sentence, detect the level of formality and informativeness. <br><br>
In data science terms, it is a regression problem rather than a classification problem, since the goal is to predict continuous variables (formality, informativeness) from text data. <br><br>
I used Linear Regression to solve the problem. Linear Regression is quick and easy to implement, as well as well suited to work with sparse data like text.

In [1]:
# import libraries
import pandas
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction import DictVectorizer
from scipy.sparse import hstack
from sklearn.linear_model import Ridge
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, mean_squared_error
from sklearn import preprocessing
from math import sqrt

## Data summary
SQUINKY Dataset contains sentences and corresponding annotations, one sentence per line.

In [2]:
dataset = pandas.read_csv('fii_annotations/mturk_experiment_1.csv', encoding = "ISO-8859-1")
pandas.DataFrame.describe(dataset)

Unnamed: 0,Sentence ID,Formality,Informativeness,Implicature,Length in Words,Length in Characters,F-score,I-score,Lexical Density
count,7032.0,7032.0,7032.0,7032.0,7032.0,7032.0,7032.0,7032.0,7032.0
mean,3510.124005,3.758817,4.580063,3.941382,19.213879,111.665956,63.422057,4.975162,64.204264
std,2031.328248,1.311312,1.184735,0.926688,11.737192,70.119067,18.213751,2.069756,12.656599
min,0.0,1.0,1.0,1.0,1.0,3.0,-25.0,0.0,0.0
25%,1751.75,2.6,3.8,3.4,10.0,59.0,53.846154,3.7,57.072829
50%,3509.5,3.8,4.8,4.0,18.0,101.0,65.909091,4.8125,63.333333
75%,5269.25,4.8,5.4,4.6,26.0,152.0,75.862069,6.071429,70.0
max,7027.0,7.0,7.0,6.8,150.0,810.0,150.0,22.0,200.0


## Data preparation
I created the data splits for training (80%) and testing (20%). <br><br>
I formatted the sentences in the training set such that all characters are lowercase and they consist of letters and digits only. Then, I constructed the TF-IDF feature vectors. TF-IDF vectors are modified versions of bag of words vectors where the concept of inverse document frequecy (IDF) is used, rather than using frequencies of distinct words in a sentence. IDF diminishes the weights of the most frequently occuring words, e.g. "the", in a document and gives weight to less frequently occuring terms.

In [3]:
# create train and test splits
train, test = train_test_split(dataset, test_size=0.2)

In [4]:
# transform train['Sentence'] to lowercase
train['Sentence'].str.lower()

# replace punctuation and special characters
train['Sentence'].replace('[^a-zA-Z0-9]', ' ', regex = True)

# convert sentences to a matrix of TF-IDF features
vectorizer = TfidfVectorizer(min_df = 5)
X_tfidf = vectorizer.fit_transform(train['Sentence'])

## Modelling
Ridge regression is a modified version of linear regression where rather than Ordinary Least Squares, the cost function is expanded with a penalty term, called L2 regularisation. It shrinks the parameters therefore handles highly correlated variables and produces a more reliable model. I built two regressors for independently predicting Formality and Informativeness.

In [5]:
# build models for predicting Formality (rgs_f) and Informativeness (rgs_i)
rgs_f = Ridge(alpha=1.0, random_state=241) # alpha = regularization strength
rgs_i = Ridge(alpha=1.0, random_state=241) # random_state = seed to use when shuffling the data

# target values
y_f = train['Formality']
y_i = train['Informativeness']

# train models on training data
rgs_f.fit(X_tfidf, y_f)
rgs_i.fit(X_tfidf, y_i)

Ridge(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=None,
   normalize=False, random_state=241, solver='auto', tol=0.001)

## Evaluation
Similarly to the training set, data preparation was done on the test set as well. Both models were evaulated on the test data. Evaluation metrics used are mean absolute error (MAE) and root mean squared error (RMSE). MAE measures the average value of error across all predictions without considering the direction of error. RMSE also measures the average size of error by taking the square root of the average of squared differences between the prediction and the actual values. It gives more weight to large errors.

In [6]:
# transform test['Sentence'] to lowercase
test['Sentence'].str.lower()

# replace punctuation and special characters
test['Sentence'].replace('[^a-zA-Z0-9]', ' ', regex = True)

# convert sentences to a matrix of TF-IDF features
X_test = vectorizer.transform(test['Sentence'])

In [8]:
# evaluate models on test data
rslt_f = rgs_f.predict(X_test)
rslt_i = rgs_i.predict(X_test)

# print evaluation metrics
print('Formality - results:')
print('MAE:', mean_absolute_error(test['Formality'], rslt_f)) # mean absolute error regression loss
print('RMSE:', sqrt(mean_squared_error(test['Formality'], rslt_f))) # root mean squared error regression loss

print('Informativeness - results:')
print('MAE:', mean_absolute_error(test['Informativeness'], rslt_i)) 
print('RMSE:', sqrt(mean_squared_error(test['Informativeness'], rslt_i)))

Formality - results:
MAE: 0.7463534878973395
RMSE: 0.9319665473156428
Informativeness - results:
MAE: 0.692765152360049
RMSE: 0.8700044869568727


According to MAE, the average deviation in predicting Formality is 0.75 and in predicting Infomativeness is 0.69. Given that the scale of the variables is 1.0 to 7.0, the model gives a good estimate of the level of Formality and Infomativeness in a sentence.

<b>Note:</b> Naturally, in a real life scenario, several models and different hyperparameter settings would have been evaluated using validation, however in this instance, due to time constraints, only a single model and a single hyperparameter setting were used.

#### Besides just predictive accuracy, what useful insights could you use this model to produce which would help a user alter the formality or informativeness of a piece of text that they’ve provided?
I think there are two levels to provide value for a user.<br><br>

*Indication*: Highlight sentences (or even words) using color codes to indicate how those are impacting the formality or informativeness scores.<br><br>

*Indication + Suggestions*: Same as the previous, but also suggest alternative options for the user to alter the formality or informativeness of a piece of text.<br><br>