![MLA Logo](https://drive.corp.amazon.com/view/mrruckma@/MLA_headerv2.png?download=true)

In [0]:
# Let's ensure we're using Eider's default compute with the Credential cell below:

In [0]:
import pandas as pd
eider.s3.download("s3://eider-datasets/mlu/projects/MLANLPIFinalProject/training.csv", "/tmp/training.csv")

In [0]:
df = pd.read_csv('/tmp/training.csv', encoding='utf-8', header=0)

In [0]:
# Let's take a look at this data in more detail and then start working. Remember 'human_tag' is our target variable/column
df.head(5)

# __1-Pre-processing Training Data:__
* Read the data
* Remove nan values
* Remove stopwords and apply stemming

Let's check our data for nan values. We want to remove rows with nan values in one or more columns.

In [0]:
# Let's see how many nan values in our data frame
print(df.isna().sum())

# Let's remove them
df.dropna(inplace=True)
print("------------nan rows removed!-------------")

# Let's see how many nan values in our data frame
print(df.isna().sum())

# __2-Splitting the training dataset into training and validation__
* Features: Title, text, star_rating
* Target: human_tag

In [0]:
from sklearn.model_selection import train_test_split

X_train, X_val, y_train, y_val = train_test_split(df[["title", "text", "star_rating"]], df["human_tag"].values, test_size=0.1, shuffle=True)

# __3-Stop Word Removal and Stemming__

In [0]:
import nltk
from nltk.corpus import stopwords
from nltk.stem import SnowballStemmer
from nltk.tokenize import word_tokenize

nltk.download('punkt', download_dir='/tmp/')
nltk.download('stopwords', download_dir='/tmp/')
nltk.data.path.append("tmp")

snow = SnowballStemmer('english') 
stop = stopwords.words('english')

#excluding some useful words from stop words list
excluding = ['against', 'not', 'don', "don't",'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't",
             'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 
             'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't",'shouldn', "shouldn't", 'wasn',
            "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]
            
stop_words = [word for word in stop if word not in excluding]

def process_text(texts): 
    final_text_list=[]
    for sent in texts:
        filtered_sentence=[]
        sent = sent.lower()
        for w in word_tokenize(sent):
            # Check if it is not numeric and its length>2 and not in stop words
            if(not w.isnumeric()) and (len(w)>2) and (w not in stop_words):  
                # Stem and add to filtered list
                filtered_sentence.append(snow.stem(w))
        final_string = " ".join(filtered_sentence) #final string of cleaned words

        final_text_list.append(final_string)
    return final_text_list

__Let's process text and title fields__

In [0]:
print("Pre-process training dataset")
X_train_title_processed = process_text(X_train["title"].values) 
X_train_text_processed = process_text(X_train["text"].values) 

print("Pre-process validation dataset")
X_val_title_processed = process_text(X_val["title"].values) 
X_val_text_processed = process_text(X_val["text"].values) 

# __4-Computing TF-IDF Vectors__
* Use title, text and star_rating columns.
* We will compute the TF-IDF vectors for training, validation and test separately.

Let's calculate TF-IDF features for training and validation

In [0]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_title_vectorizer = TfidfVectorizer(max_features=3500,  ngram_range=(1, 2))
tfidf_text_vectorizer = TfidfVectorizer(max_features=9000, ngram_range=(1, 2)) # Text field is much longer than title

# We fit the vectorizers using only the training data
tfidf_title_vectorizer.fit(X_train_title_processed)
tfidf_text_vectorizer.fit(X_train_text_processed)

X_train_title_vectors = tfidf_title_vectorizer.transform(X_train_title_processed)
X_train_text_vectors = tfidf_text_vectorizer.transform(X_train_text_processed)

X_val_title_vectors = tfidf_title_vectorizer.transform(X_val_title_processed)
X_val_text_vectors = tfidf_text_vectorizer.transform(X_val_text_processed)

Let's put everything together: star rating + title vector + text vector

The size of feature for each row becomes 1 + 3500 + 9000 = 12501

In [0]:
import numpy as np

# ------Training data -----------
# Normalize star rating
X_train_star_rating_norm = X_train["star_rating"] / X_train["star_rating"].max()

# Let's use the star rating and the tf-idf vectors together 
X_train_merged = np.column_stack((X_train_star_rating_norm, X_train_title_vectors.toarray(), X_train_text_vectors.toarray()))

# ------Validation data -----------
# Normalize star rating
X_val_star_rating_norm = X_val["star_rating"] / X_val["star_rating"].max()

# Let's use the star rating and the tf-idf vectors together 
X_val_merged = np.column_stack((X_val_star_rating_norm, X_val_title_vectors.toarray(), X_val_text_vectors.toarray()))


# __5-Training:__
* Train using X_train and y_train
* Use validation data (X_val and y_val) to see how well it works.

In [0]:
#Training the model and Testing Accuracy on Validation data
from sklearn.metrics import classification_report, accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import make_scorer, f1_score

# Try l1 and l2 regularization with coeff [0.05, 0.1, ..., 1.7, 1.75]
parameters = {'penalty':['l1', 'l2'], 'C': np.arange(0.05, 1.75, 0.05)}
lr = LogisticRegression(class_weight='balanced')

# You can experiment with different cv numbers.
clf = RandomizedSearchCV(lr, parameters, cv=5, scoring=make_scorer(f1_score), verbose=1, n_jobs=-1)

clf.fit(X_train_merged, y_train)

y_val_pred = clf.predict(X_val_merged)
y_val_pred_probs = clf.predict_proba(X_val_merged)
print(classification_report(y_val, y_val_pred)) 

# __6-Picking the Probability Threshold__:
We will plot Precision-Recall curve and pick the point with the highest f1 score. We can easliy calculate f1 score using precision and recall.

In [0]:
import matplotlib.pyplot as plt
from sklearn.metrics import precision_recall_curve

precisions, recalls, thresholds = precision_recall_curve(y_val, y_val_pred_probs[:, 1])

# Let's plot the Precision-Recall curve, precision on the y axis and recall on the x axis

# plot no skill
plt.plot([0, 1], [0.5, 0.5], linestyle='--')

# plot the roc curve for the model
plt.plot(recalls, precisions, marker='.')

plt.title('Precision-Recall Curve')
plt.xlabel('Recall')
plt.ylabel('Precision')

# show the plot
plt.show()

We will calculate the F1 score using the precision and recall from the curve above. Later, we will pick the threshold that resulted in the largest F1 score as our new decision boundary.

![f1](https://drive-render.corp.amazon.com/view/cesazara@/cv-notebook-images/f1_score.png?download=true)

In [0]:
highest_f1 = 0
threshold_highest_f1 = 0
for id, threhold in enumerate(thresholds):
    f1_score = 2*precisions[id]*recalls[id]/(precisions[id]+recalls[id])
    if(f1_score > highest_f1):
        highest_f1 = f1_score
        threshold_highest_f1 = threhold
print("Highest F1:", highest_f1, ", Threshold for the highest F1:", threshold_highest_f1)

# __7-Getting predictions on test data and saving results__:
Prepare the test data using the previous same steps (except splitting)
* Read the test data and impute mising values
* Stopword removal and stemming
* Get TF-IDF feaures
* Predict the results using the same model we trained on. 
* Save the result to tmp folder

__Reading and imputing:__

In [0]:
# Download the test data
eider.s3.download("s3://eider-datasets/mlu/projects/MLANLPIFinalProject/public_test_features.csv", "/tmp/public_test_features.csv")

# Read the test data (It doesn't have the human_tag label, we are trying to predict that :D )
test_df = pd.read_csv('/tmp/public_test_features.csv', encoding='utf-8', header=0)

# Fill missing text and title values
test_df['text'] = test_df['text'].fillna("")
test_df['title'] = test_df['title'].fillna("")

__Stopword removal and stemming:__

In [0]:
print("Process test dataset")
X_test_title_processed = process_text(test_df["title"].values) 
X_test_text_processed = process_text(test_df["text"].values) 

__Get TF-IDF feaures:__

In [0]:
X_test_title_vectors = tfidf_title_vectorizer.transform(X_test_title_processed)
X_test_text_vectors = tfidf_text_vectorizer.transform(X_test_text_processed)

__Predict the results using the same model we trained on:__

In [0]:
# ------Test data -----------

# Normalize star rating
X_test_star_rating_norm = test_df["star_rating"] / test_df["star_rating"].max()

# Let's use the star rating and the tf-idf vectors together 
X_test_merged = np.column_stack((X_test_star_rating_norm, X_test_title_vectors.toarray(), X_test_text_vectors.toarray()))

# Make predictions using our trained model
test_prediction_probs = clf.predict_proba(X_test_merged)[:, 1]

# Let's apply the new threshold to this value
test_prediction = np.where(test_prediction_probs > threshold_highest_f1, 1, 0)

__Save the result to tmp folder__

In [0]:
import pandas as pd

result_df = pd.DataFrame()
result_df["ID"] = test_df["ID"]
result_df["human_tag"] = test_prediction

result_df.to_csv("tmp/project_day2_result.csv", encoding='utf-8', index=False)

# 7-Getting our model output out of Eider and into Leaderboard
Great. Now we have a dummie sample submission in Eider that we now need to export locally so that we may then upload to Leaderboard in the following steps:
1. Within the Eider console top bar, select [Files](https://eider.corp.amazon.com/file)
2. You should now see 'Files', 'TMP' and 'Exported notebooks' tabs. 
3. Select 'TMP' then select 'Connect to workspace'. You should now see any files from your last run of your workspace. If there was no 'Connect to workspace' option, your files from the last run should already be present.
4. Go to the 'my_sample_output_day2.csv' file and select Save
5. This file will now be permanently saved to your Eider account and available for local download from the 'Files' tab via the download button.

We now have our model's output .csv and are ready to upload to Leaderboard
1. Search for your class [Leaderboard instance](https://leaderboard.corp.amazon.com/) and go to the 'Make a Submission' section
2. Upload your local file and include your notebook version URL for tracking
3. Your score on the public leaderboard should now appear. Marvel on how much room for improvement there is
