# Code Overview

## 1. **Library Imports**

In [None]:
### Imports of libraries to be used

from distutils.util import rfc822_escape
import pandas as pd
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV, learning_curve
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import reciprocal, uniform

from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer

# Models to be used

from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import SGDClassifier

# Other models for text classification
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import LogisticRegression

# Reporting
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_recall_curve
import matplotlib.pyplot as plt

## 2. **Models Defined & Other Models to Explore**

All the models that could be used for text classification purposes will be defined with `random_state = 42` to control the shuffling process of the train-test split for all the models defined in this project.

In [1]:
### Models Defined

#1. Naive Bayes Model
mr_naivebayes = MultinomialNB()

#2. Naive Bayes Multiclass
mr_naivebayes_multiclass = MultinomialNB()

#3. SGD Classifier
sgd = SGDClassifier(loss='hinge', penalty='l2', alpha=1e-3, random_state=42, max_iter=5, tol=None)

#4. SGD Classifier Multiclass
sgd_multiclass = SGDClassifier(loss='hinge', penalty='l2', alpha=1e-3, random_state=42, max_iter=5, tol=None)

### Other Models to Explore

#5. K-Nearest Neigbours  
knn = KNeighborsClassifier()

#6. Support Vector Machine
svm = SVC(random_state=42)

#7. Random Forest
rf = RandomForestClassifier(random_state=42)

#8. Decision Tree
dt = DecisionTreeRegressor(random_state=42)

#9. Logistic Regression
lr = LogisticRegression(random_state=42)

## 3. **Pipeline, Parameter Grid & GridSearch Cross-Validation**

A pipeline combining a text feature extractor with a classifier to be chosen later on is defined.
- **CountVectorizer**: <br/>

    CountVectorizer converts texts into count frequency. Count Vectors will be helpful in understanding the type of text by the frequency of words in it. <br/> <br/>
    But its major disadvantages are: <br/>
     - Its inability in identifying more important and less important words for analysis.
     - It will just consider words that are abundant in a corpus as the most statistically significant word.
     - It also doesn't identify the relationships between words such as linguistic similarity between words. 
<br/>
<br/>    
-  **TFIDF**: <br/>
    
    TFIDF provides a numerical representation of how important a word is for statistical analysis. It is based on the logic that words that are too abundant in a corpus and words that are too rare are both not statistically important for finding a pattern.

In [None]:
### Pipeline

def define_pipeline(model):
    """
    Function that returns the pipeline for the model that haven been selected.
    Params:
    - model: object. The model whose pipeline will be returned.
    Output:
    - pipeline: object. The pipeline object to do the transformations according to the model.
    """
    
    pipeline = Pipeline([
                ('vect', CountVectorizer()),
                ('tfidf', TfidfTransformer()),
                ('model', model),
                ])
    return pipeline

- Link for *Stochastic Gradient Descent*: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html
- Link for *Naive Bayes*: https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html
- Link for *K-Nearest Neighbours*: https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html
- Tips:                          https://medium.datadriveninvestor.com/k-nearest-neighbors-in-python-hyperparameters-tuning-716734bc557f
- Link for *Support Vector Machine*: https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html
- Tips:                            https://medium.com/analytics-vidhya/hyperparameter-tuning-an-svm-a-demonstration-using-hyperparameter-tuning-cross-validation-on-96b05db54e5b#:~:text=What%20is%20hyperparameter%20tuning%20%3F,of%20decrease%20them%20for%20ex.
- Link for *Random Forest*: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
- Tips:                   https://developer.spotify.com/documentation/web-api/reference/#/operations/get-track
- Link for *Decision Tree*: https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html
- Tips:                   https://www.kaggle.com/code/gauravduttakiit/hyperparameter-tuning-in-decision-trees/notebook
- Link for *Logistic Regression*: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
- Tips:                         https://machinelearningmastery.com/hyperparameters-for-classification-machine-learning-algorithms/


Although all of the aforementioned models will have been defined for this project, only NB and SGD will be used to simplify our analysis. In contrast, however, the selection of hyperparameters was based on the most frequently used parameters for each model in data science.

Moreover, the definition of two different ngram ranges were made.

The lower and upper boundary of the range of n-values for different word n-grams or char n-grams are extracted in a way that all values of n such such that n is bigger than the minimum of n and n is smaller than the maximum of n. Therefore since ngram_range of (1, 1) means only unigrams, (1, 2) means unigrams and bigrams, and (2, 2) means only bigrams we decided to explore the different possibilities both incrementally and individually.

The analyzer for our models was to be based on words rather than characters, otherwise it would defeat the purpose of the project.

In [None]:
def define_param_grid(model: object, ngram: int, method=1):
    """
    Function that returns the parameter grid based on the model that was defined in the pipeline and the n-gram selected.
    Params:
    - model: object. The model that you want to fine tune with hyperparameters.
    - ngram: int. The range of ngrams you want to fine tune the model with.
    - method: int. 1 returns the incremented ngram range, 2 returns the individual ngrams.
    Output:
    - parameters: dict. The parameters defined in the function for the specific model chosen.
    IMP: More parameters can be added for each model to add complexity BUT will take much longer!
    """

    ### Choose the ngram method to be taken
    
    # If method is to be INCREMENTAL ngrams
    if method == 1:
        # N-gram range for hyperparameter tuning
        vect__ngram_range = []
        for i in range(ngram):
            vect__ngram_range.append((1, i+1))
    
    # If method is to be INDIVIDUAL ngrams
    elif method == 2:
        # N-gram range for hyperparameter tuning
        vect__ngram_range = []
        for i in range(ngram):
            vect__ngram_range.append((i+1, i+1))  
    
    ### Get the model chosen from the pipe
    model_chosen = define_pipeline(model)['model']  

    ### Obtain parameter grid according to the model chosen
    
    # Naive Bayes and its Multiclass
    if model_chosen == mr_naivebayes or mr_naivebayes_multiclass:
        parameters = {
    'vect__ngram_range': vect__ngram_range,
    'model__alpha': [10 ** -x for x in range(1, 10)],
    }

    # Stochastic Gradient Descent and its Multiclass    
    elif model_chosen == sgd or model_chosen == sgd_multiclass:
        parameters = {
    'vect__ngram_range': vect__ngram_range,
    'model__alpha': [10 ** -x for x in range(1, 10)],
    'model__loss': ['hinge', 'log_loss', 'log', 'modified_huber', 
                    'squared_hinge', 'perceptron', 'squared_error', 
                    'huber', 'epsilon_insensitive', 'squared_epsilon_insensitive'],
    'model__penalty': ['l2', 'l1', 'elasticnet'],
    'model__fit_intercept': [True, False],
    }

    # K-Nearest Neighbours    
    elif model_chosen == knn:
        parameters = {
    'vect__ngram_range': vect__ngram_range,
    'model__leaf_size': [range(1, 50)],
    'model__n_neighbors': [range(1, 30)],
    'model__algorithm': ['auto', 'ball_tree', 'kd_tree', 'brute'],
    'model__weights': ['uniform', 'distance'],
    'model__p': [1, 2],
    }

    # Support Vector Machine
    elif model_chosen == svm:
        parameters = {
    'vect__ngram_range': vect__ngram_range,
    'model__gamma': [0.1, 1.0, 10, 100, 1000],
    'model__C': [0.1, 1.0, 10, 100, 1000],
    'model__kernel': ['linear', 'poly', 'rbf', 'sigmoid', 'precomputed'], 
    }

    # Random Forest
    elif model_chosen == rf:
        parameters = {
    'vect__ngram_range': vect__ngram_range,        
    'model__n_estimators': [int(x) for x in np.linspace(start = 200, stop = 2000, num = 10)],
    'model__max_features': ['auto', 'sqrt'],
    'model__max_depth': [int(x) for x in np.linspace(10, 110, num = 11)],
    'model__min_samples_split': [2, 5, 10],
    'model__min_samples_leaf': [1, 2, 4],
    'model__bootstrap': [True, False],
    }
        
    # Decision Tree    
    elif model_chosen == dt:
        parameters = {
    'vect__ngram_range': vect__ngram_range,        
    'model__max_depth': [2, 3, 5, 10, 20],
    'model__min_samples_leaf': [5, 10, 20, 50, 100],
    'model__criterion': ['squared_error', 'friedman_mse', 'absolute_error', 'poisson'],
    'model__splitter': ['best', 'random'],
    'model__max_features': ['auto', 'sqrt', 'log2']
    }

    # Logistic Regression
    elif model_chosen == lr:
        parameters = {
    'vect__ngram_range': vect__ngram_range, 
    'model__solvers': ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga'],
    'model__penalty':  ['none', 'l1', 'l2', 'elasticnet'],
    'model__C': [0.01, 0.1, 1.0, 10, 100],        
    }
    return parameters

Then a function to find the best parameters for both the feature extraction and the classifier was to be defined using a 5-fold cross validation.

By utilising cross validation and grid search, we were able to achieve a more meaningful result than with our initial train/test split and minimal tuning. Cross validation is a crucial technique for creating better-fitting models by training and testing on all training dataset components.

In [None]:
def define_gridsearch(model: object, param_grid: dict, scorer = "accuracy"):
    """
    Function that returns the parameter grid based on the model that was defined in the pipeline and the n-gram selected.
    Params:
    - model: object. The model to be fine tuned with hyperparameter selections.
    - param_grid: dict. The combination of parameters to be run across the model chosen.
    - scorer: str. The metric on which the model is to be evaluated. Default = accuracy.
    Output:
    - grid: object. The cross-validated scores based on the scorer selected to get the best estimator and parameters.
    """
    pipe = define_pipeline(model)
    grid = GridSearchCV(pipe, param_grid, cv = 5, scoring = scorer, return_train_score = True)
    return grid

## 4. **Convenience Functions, Learning Curves and Report Generators**

Here, we define convenience functions to generate a dataframe that displays the results of the cross-validation experiment and sorts the highest or best estimators at the top. In addition, the feature weights will be sorted as an additional metric in order to comprehend the combination of ngrams and words that influence the classification of the word to the topic.

Then, we will use a function that calls the other functions in order to obtain the optimal estimator's results based on the CV and weights.

Using a learning curve to explain the accuracy of the fitted best estimator model, the progression of the model as the number of samples increases will be understood.

In [4]:
def crossvalidation_report_df(grid_cv):
    """
    Convenience function that creates a simple dataframe that reports the results of a cross-validation experiment. 
    Params:
    - grid: object. Input cross-validated grid that must be fit.
    Output:
    - dataframe. The scores sorted by rank of experiment.
    """
    # Pick columns that define each experiment (start with param) and the columns that report mean_test and rank_test results
    cols = [c for c in grid_cv.cv_results_ if (c.startswith('param') or c in ['mean_test_score', 'rank_test_score'])]

    # Sort original df by rank, and select columns
    return pd.DataFrame(grid_cv.cv_results_).sort_values(by='rank_test_score')[cols]

def sort_feature_weights(grid, fkey='vect', wkey='model'):
    """ 
    Convenience function that gets the weights of each words/ngram and orders them from lowest to highest. 
    Highest being words most associated with the given topic.
    Output:
    - list. Returns a list of tuples with word/ngram and its weight.
    """
    
    # Obtain the features of the best estimator post fit and the corresponding weights assigned to them
    F = grid.best_estimator_[fkey].get_feature_names_out()
    try:
        W = grid.best_estimator_[wkey].coef_[0]
    except AttributeError: #for the mr_naivebayes
        W = grid.best_estimator_[wkey].feature_log_prob_[1] 
    
    # Sort the values based on weights
    return sorted(zip(F, W), key=lambda fw: fw[1]) 

def sort_feature_multiclassweights(grid, fkey='vect', wkey='model'):
    """ 
    Convenience function that gets the weights of each words/ngram and orders them from lowest to highest. 
    Highest being words most associated with the given topic.
    Output:
    - list. Returns a list of tuples with word/ngram and its weight.
    """
    
    # Obrain the featurs of the best estimator post fit and the corresponding weights per topic
    F = grid.best_estimator_[fkey].get_feature_names_out()
    try:
        science = grid.best_estimator_[wkey].coef_[0]
        sports = grid.best_estimator_[wkey].coef_[1]
        world = grid.best_estimator_[wkey].coef_[2]
        business = grid.best_estimator_[wkey].coef_[3]
    except AttributeError: #for the mr_naivebayes
        science = grid.best_estimator_[wkey].feature_log_prob_[0] 
        sports = grid.best_estimator_[wkey].feature_log_prob_[1]
        world = grid.best_estimator_[wkey].feature_log_prob_[2]
        business = grid.best_estimator_[wkey].feature_log_prob_[3]
    
    # Sort the values based on weights
    return sorted(zip(F, science, sports, world, business), key=lambda fw: fw[1]) 


def apply_modelling(grid, train_data, test_data):
    """
    Convenience function that creates the model for all news types and returns all the info we need to analyze the model
    Output:
    - cv_report: df report of the CV with the params tried and their scores in the CV.
    - score_report: report of the best_estimator's performance on the test data.
    - bestestimator_parameters: parameters used in the best estimator.
    - bestestimator_weights: word weigths for the best estimator.
    """ 
    
    # Get the cross-validation report, the accuracy score report, the best estimator's parameters & weights and the grid
    
    targets = ["science_int", "sports_int", "world_int", "business_int"]
    report_dict = {
        "targets": targets, 
        "cv_report":[], 
        "score_report":[],
        "bestestimator_parameters":[],
        "grid":[], 
        "bestestimator_weights":[]
    }

    for target in targets:
        grid.fit(train_data.text, train_data.loc[:,target])
        cv_report = crossvalidation_report_df(grid)
        score_report = classification_report(test_data.loc[:,target], 
            grid.best_estimator_.predict(test_data.text),target_names = ["others", target[:-4]])
        bestestimator_parameters = grid.best_params_
        bestestimator_weights = sort_feature_weights(grid) 
        report_dict["cv_report"].append(cv_report)
        report_dict["score_report"].append(score_report)
        report_dict["bestestimator_parameters"].append(bestestimator_parameters)
        report_dict["grid"].append(grid)
        report_dict["bestestimator_weights"].append(bestestimator_weights)
    return report_dict


def apply_multiclassmodelling(grid, train_data, test_data):
    """
    Convenience function that creates the model for all news types and returns all the info we need to analyze the model
    Output:
    - cv_report: df report of the CV with the params tried and their scores in the CV.
    - score_report: report of the best_estimator's performance on the test data.
    - bestestimator_parameters: parameters used in the best estimator.
    - bestestimator_weights: word weigths for the best estimator.
    """ 
    
    targets = ["label_int"]
    report_dict = {
        "targets": targets, 
        "cv_report":[], 
        "score_report":[],
        "bestestimator_parameters":[],
        "grid":[], 
        "bestestimator_weights":[]
    }

    for target in targets:
        grid.fit(train_data.text, train_data.loc[:,target])
        cv_report = crossvalidation_report_df(grid)
        score_report = classification_report(
            test_data.loc[:,target], 
            grid.best_estimator_.predict(test_data.text),
            target_names = ["Science", "Sports", "World", "Business"])
        bestestimator_parameters = grid.best_params_
        bestestimator_weights = sort_feature_multiclassweights(grid) #order of weights is like in tabl (science first,...)
        report_dict["cv_report"].append(cv_report)
        report_dict["score_report"].append(score_report)
        report_dict["bestestimator_parameters"].append(bestestimator_parameters)
        report_dict["grid"].append(grid)
        report_dict["bestestimator_weights"].append(bestestimator_weights)
    return report_dict


def create_learningcurve(train_data, test_data, selected_estimator, loss, report_dict):
    """
    Function that creates the learning curve for the selected estimator.
    Output:
    - dict. The report_dict with an added column for the learning curve.
    """
    
    targets = ["science_int", "sports_int", "world_int", "business_int"]
    report_dict["training_curve"] = []
    for target in targets:
        
        #Takes a loss function and a model and find the performance for a given amount of data for a specific data set
        training_curve = np.array([[0,0]])
        n_samples = 10
        while n_samples < train_data.shape[0]:
            print (n_samples)
            i=0
            new_model = selected_estimator.fit(
                train_data.loc[:n_samples,"text"], train_data.loc[:n_samples,target]
                )
            y_pred = new_model.predict(test_data.text)
            score = loss(test_data.loc[:,target], y_pred)
            training_curve = np.append(training_curve,[[n_samples, score]], axis = 0)
            n_samples *= 2
        report_dict["training_curve"].append(training_curve)
    return report_dict

def create_multiclasslearningcurve(train_data, test_data, selected_estimator, loss, report_dict):
    """
    Function that creates the learning curve for the selected estimator.
    Output:
    - dict. The report_dict with an added column for the learning curve.
    """
    
    targets = ["label_int"]
    report_dict["training_curve"] = []
    for target in targets:
        
        #Takes a loss function and a model and find the performance for a given amount of data for a specific data set
        training_curve = np.array([[0,0]])
        n_samples = 10
        while n_samples < train_data.shape[0]:
            print (n_samples)
            i=0
            new_model = selected_estimator.fit(
                train_data.loc[:n_samples,"text"], train_data.loc[:n_samples,target]
                )
            y_pred = new_model.predict(test_data.text)
            score = loss(test_data.loc[:,target], y_pred)
            training_curve = np.append(training_curve,[[n_samples, score]], axis = 0)
            n_samples *= 2
        report_dict["training_curve"].append(training_curve)
    return report_dict


##Analysis functions that can be used to dig into the models##

def show_cv_report(report_dict):
    """
    Convenience function to print results to analyze them.
    Output:
    - dict: The report dictionary we would like to analyze.
    """
    
    targets = ["science_int", "sports_int", "world_int", "business_int"]
    for index, value in enumerate(targets):
        print("\n CV report for {}\n".format(value[:-4]))
        print(report_dict["cv_report"][index])

def show_classification_report(report_dict):
    ''' Convenience function to print results to analyze them
    report_dict: report dictionary we would like to analyze'''
    targets = ["science_int", "sports_int", "world_int", "business_int"]
    for index, value in enumerate(targets):
        print("\n Classification report of best estimator for {}\n".format(value[:-4]))
        print(report_dict["score_report"][index])

def show_best_estimator(report_dict):
    ''' Convenience function to print results to analyze them
    report_dict: report dictionary we would like to analyze'''
    targets = ["science_int", "sports_int", "world_int", "business_int"]
    for index, value in enumerate(targets):
        print("\n Parameters of best estimator for {}\n".format(value[:-4]))
        print(report_dict["bestestimator_parameters"][index])    

def show_word_weights(report_dict):
    ''' Convenience function to print results to analyze them
    report_dict: report dictionary we would like to analyze'''
    targets = ["science_int", "sports_int", "world_int", "business_int"]
    for index, value in enumerate(targets):
        print("\n Top words for {} \n".format(value[:-4]))
        print(report_dict["bestestimator_weights"][index][-20:])
        print("\n Worst words for {}\n".format(value[:-4]))
        print(report_dict["bestestimator_weights"][index][:20])      

def show_multiclassword_weights(multiclass_report_dict):
    ''' Convenience function to print results to analyze them
    multiclass_report_dict: report dictionary we would like to analyze'''
    targets = ["science_int", "sports_int", "world_int", "business_int"]
    for index, value in enumerate(targets):
        sorted_words = sorted(multiclass_report_dict["bestestimator_weights"][0], key=lambda fw: fw[index + 1])
        print("\n Top words for {}\n".format(value[:-4]))
        print(sorted_words[-20:])
        print("\n Worst words for {}\n".format(value[:-4]))
        print(sorted_words[:20])      


def show_learning_curve(report_dict):
    ''' Convenience function to print results to analyze them
    report_dict: report dictionary we would like to analyze'''
    targets = ["science_int", "sports_int", "world_int", "business_int"]
    for index, value in enumerate(targets):
        print("\n Learning curve for {}\n".format(value[:-4]))
        x = report_dict["training_curve"][index][:,0]
        y = report_dict["training_curve"][index][:,1]
        plt.plot(x,y, label = value[:-4])
    plt.ylabel("Accuracy")
    plt.xlabel("# of samples in training")
    plt.legend()
    plt.show()

def show_multiclasslearning_curve(multiclass_report_dict):
    ''' Convenience function to print results to analyze them
    multiclass_report_dict: report dictionary we would like to analyze'''
    targets = ["label_int"]
    for index, value in enumerate(targets):
        print("\n Learning curve for {}\n".format(value))
        x = multiclass_report_dict["training_curve"][index][:,0]
        y = multiclass_report_dict["training_curve"][index][:,1]
        plt.plot(x,y, label = "Overall")
    plt.ylabel("Accuracy")
    plt.xlabel("# of samples in training")
    plt.legend()
    plt.show()


def show_precision_recall_curve(report_dicts, test_data):
    '''Convenience function that prints all precision recall curves of the models to be analyzed
    report_dicts: list of report dictionaries for which we would like to analyze the best models
    test_data: df with the test data provided
    returns a plot with the charts'''

    #create a list of the models to analyze
    models = []
    for report_dict in report_dicts:
        models.append(report_dict["grid"][0])

    targets = ["science_int", "sports_int", "world_int", "business_int"]

    #Create a chart for each prediction target
    for target_index, target in enumerate(targets):

        plt.figure(figsize=(10,10))
        plt.title('Precision-Recall curve for {}'.format(target))
        plt.xlabel('precision')
        plt.ylabel('recall')


        colors = ["green", "red", "orange", "black", "blue", "pink"]
        markers = ['o', 'v', '^', '<', '>', '8']
        #Create the scatter for each model
        for model_index, model in enumerate(models):
            pr = precision_recall_curve(
                test_data.loc[:,target],
                model.decision_function(test_data.text),
                pos_label=1)
            plt.scatter(y=pr[0], x=pr[1], label='Model {}'.format(model_index), alpha = 0.5, linewidths = 0.5, color = colors[model_index], marker = markers[model_index])
          
        plt.grid(True)
        plt.legend()
        plt.show()


def show_multiclassprecision_recall_curve(multiclass_report_dicts, test_data):
    '''Convenience function that prints all precision recall curves of the models to be analyzed
    multiclass_report_dicts: list of report dictionaries for which we would like to analyze the best models
    test_data: df with the test data provided
    returns a plot with the charts'''

    #create a list of the models to analyze
    models = []
    for report_dict in multiclass_report_dicts:
        models.append(report_dict["grid"][0])

    targets = ["science_int", "sports_int", "world_int", "business_int"]
    fig, axes = plt.subplots(2,4)

    #Create a chart for each prediction target
    for target_index, target in enumerate(targets):

        axes[1, target_index].set_title('Precision-Recall curve for {}'.format(target))
        axes[1, target_index].set_xlabel('precision')
        axes[1, target_index].set_ylabel('recall')
        axes[1, target_index].grid(True)


        colors = ["green", "red", "orange", "black", "blue", "pink"]
        markers = ['o', 'v', '^', '<', '>', '8']
        #Create the scatter for each model
        for model_index, model in enumerate(models):
            pr = precision_recall_curve(
                test_data.loc[:,target],
                model.decision_function(test_data.text)[:,target_index],
                pos_label=1)
            axes[1, target_index].scatter(y=pr[0], x=pr[1], label='Model {}'.format(model_index), alpha = 0.5, linewidths = 0.5, color = colors[model_index], marker = markers[model_index])
        
    fig.legend()
    plt.show()

## 5. **Base Models (only Unigram)**

### a. *Models for each Target*: **Stochastic Gradient Descent & Multinomial Naive Bayes**

In [None]:
if __name__ == "__main__":

    #Load data
    train = pd.read_csv("./06_Session 6 - NLP/Advanced_AI_NLP/agnews_train.csv")
    test = pd.read_csv("./06_Session 6 - NLP/Advanced_AI_NLP/agnews_test.csv")

    ####### Models for each target ######

    #Define model (Base)
    param_grid = {
        'vect__ngram_range': [(1,1)],
        'model__alpha': [(1e-9)]
        }    

    #create grids for basic sgd and nb
    grid_sgd = define_gridsearch(model = sgd, param_grid = param_grid, scorer = "accuracy")
    grid_nb = define_gridsearch(model = sgd, param_grid = param_grid, scorer = "accuracy")
    
    #Create reporting data for sgd and nb
    report_dict_sgd = apply_modelling(grid_sgd, train, test)
    report_dict_nb = apply_modelling(grid_nb, train, test)
    report_dict_sgd = create_learningcurve(train, test, grid_sgd.best_estimator_, accuracy_score, report_dict_sgd) #can choose any estimator; don't have to choose best_estimator here
    report_dict_nb = create_learningcurve(train, test, grid_nb.best_estimator_, accuracy_score, report_dict_nb) #can choose any estimator; don't have to choose best_estimator here


    #Saving the results of the testing into 
    df = pd.DataFrame(report_dict_sgd) 
    df.to_csv (r"./06_Session 6 - NLP/Advanced_AI_NLP/base_test_series_data.csv",index = False, header=True)

    ##Analysis##

    #looking at the performance of the different hyperparameters in the gridsearch
    print("CV report for each target")
    show_cv_report(report_dict_sgd)

    #Looking at the classification report of the best estimator
    print("Classification reports of best estimators")
    show_classification_report(report_dict_sgd)

    #Looking at the parameters of the best estimator
    print("Parameters of best estimators")
    show_best_estimator(report_dict_sgd)

    #Look at word weights
    print("Word weights of best estimators for each target")
    show_word_weights(report_dict_sgd)

    #looking at the learning curve
    print("Leanring curve of models")
    show_learning_curve(report_dict_sgd)

    #Comparing the precision recall curve of different models
    report_dicts = [report_dict_sgd, report_dict_nb]
    show_precision_recall_curve(report_dicts, test)

### b. *Multiclass*: **Stochastic Gradient Descent**

In [None]:
if __name__ == "__main__":

    #Load data
    train = pd.read_csv("./06_Session 6 - NLP/Advanced_AI_NLP/agnews_train.csv")
    test = pd.read_csv("./06_Session 6 - NLP/Advanced_AI_NLP/agnews_test.csv") 
        
    ####### Multiclass models ######

    #Define model
    param_grid = {
        'vect__ngram_range': [(1,1)],
        'model__alpha': [(1e-9)]
        }    
    multiclass_grid = define_gridsearch(model = sgd_multiclass, param_grid = param_grid, scorer = "accuracy") #Not sure whether multiclass would work with the bayes

    #Create reporting data
    multiclass_report_dict = apply_multiclassmodelling(multiclass_grid, train, test)
    multiclass_report_dict = create_multiclasslearningcurve(train, test, multiclass_grid.best_estimator_, accuracy_score, multiclass_report_dict) #can choose any estimator; don't have to choose best_estimator here

    #Saving the results of the testing into 
    multiclass_df = pd.DataFrame(multiclass_report_dict) 
    multiclass_df.to_csv (
        r"./06_Session 6 - NLP/Advanced_AI_NLP/multiclass_base_test_series_data.csv",
         index = False, header=True
         )

    ##Analysis##

    #looking at the performance of the different hyperparameters in the gridsearch
    print("\n Gridsearch CV report \n")
    print(multiclass_report_dict["cv_report"][0])      

    #Looking at the classification report of the best estimator
    print("\n Classification report of best estimator \n")
    print(multiclass_report_dict["score_report"][0])

    #Looking at the parameters of the best estimator
    print("\n Parameters of best estimator \n")
    print(multiclass_report_dict["bestestimator_parameters"][0])

    #Look at word weights
    print("\n Word weights for each target \n")
    show_multiclassword_weights(multiclass_report_dict)

    #looking at the learning curve
    print("\n Learning curve of models \n")
    show_multiclasslearning_curve(multiclass_report_dict)

    #Comparing the precision recall curve of different models
    multiclass_report_dicts = [multiclass_report_dict]
    show_multiclassprecision_recall_curve(multiclass_report_dicts, test)

## 6. *Models for each target*: **Hyperparameter Tuned Models (up to Fivegram)**

In this section, we will define models that will output the best parameters and ngrams for each individual target: science, sports, world, and business. As each of its models has its own context, this should allow for greater precision, as a specific ngram range may only be applicable to one topic and not all of them. This should allow for greater average precision than the multiclass we will define later.

We chose fivegram as our maximum range because, for text classification purposes in data science, fourgram is typically sufficient to achieve the highest degree of precision. Additionally, this depends on the length of each topic's sentences. A higher ngram range could capture almost the entire sentence and therefore does not allow for a more nuanced understanding of the sentence's constituent parts.

In the models described below, it was almost always the case that the optimal ngram range was up to a unigram or a bigram, with the rare exception of a trigram. However, the F1 score tends to decrease thereafter, indicating that the selection of fivegram was adequate.

The slightly superior performance of the Multinomial Naive Bayes model could be attributed to the fact that the dataset was perfectly balanced; otherwise, the model would have returned different results.

### a. **Multinomial Naive Bayes**
> i. **Incremental (from unigram only up to range of unigram to fivegram)** <br/>
> ii. **Individual (from unigram only up to fivegram only)**

In [None]:
if __name__ == "__main__":

    #Load data
    train = pd.read_csv("./06_Session 6 - NLP/Advanced_AI_NLP/agnews_train.csv")
    test = pd.read_csv("./06_Session 6 - NLP/Advanced_AI_NLP/agnews_test.csv") 

    ####### Hyperparameter tuning and ngram models ######
    
    ### i

    #Define Multinomial Naive Bayes model using incremental method up to fivegram
    param_grid = define_param_grid(model = mr_naivebayes, ngram = 5, method = 1)
    grid = define_gridsearch(model = mr_naivebayes, param_grid = param_grid, scorer = "accuracy")
    
    #Create reporting data
    report_dict = apply_modelling(grid, train, test)
    report_dict_nb_1 = create_learningcurve(train, test, grid.best_estimator_, accuracy_score, report_dict) #can choose any estimator; don't have to choose best_estimator here

    #Saving the results of the testing into 
    df = pd.DataFrame(report_dict_nb_1) 
    df.to_csv (r"./06_Session 6 - NLP/Advanced_AI_NLP/NB_tuned_test_series_data.csv",index = False, header=True)

    ##Analysis##

    #looking at the performance of the different hyperparameters in the gridsearch
    print("CV report for each target")
    show_cv_report(report_dict_nb_1)

    #Looking at the classification report of the best estimator
    print("Classification reports of best estimators")
    show_classification_report(report_dict_nb_1)

    #Looking at the parameters of the best estimator
    print("Parameters of best estimators")
    show_best_estimator(report_dict_nb_1)

    #Look at word weights
    print("Word weights of best estimators for each target")
    show_word_weights(report_dict_nb_1)

    #looking at the learning curve
    print("Leanring curve of models")
    show_learning_curve(report_dict_nb_1)
    
    ### ii

    #Define Multinomial Naive Bayes model using individual method up to fivegram
    param_grid = define_param_grid(model = mr_naivebayes, ngram = 5, method = 2)
    grid = define_gridsearch(model = mr_naivebayes, param_grid = param_grid, scorer = "accuracy")
    
    #Create reporting data
    report_dict = apply_modelling(grid, train, test)
    report_dict_nb_2 = create_learningcurve(train, test, grid.best_estimator_, accuracy_score, report_dict) #can choose any estimator; don't have to choose best_estimator here

    #Saving the results of the testing into 
    df = pd.DataFrame(report_dict_nb_2) 
    df.to_csv (r"./06_Session 6 - NLP/Advanced_AI_NLP/NB_tuned_test_series_data_#2.csv",index = False, header=True)

    ##Analysis##

    #looking at the performance of the different hyperparameters in the gridsearch
    print("CV report for each target")
    show_cv_report(report_dict_nb_2)

    #Looking at the classification report of the best estimator
    print("Classification reports of best estimators")
    show_classification_report(report_dict_nb_2)

    #Looking at the parameters of the best estimator
    print("Parameters of best estimators")
    show_best_estimator(report_dict_nb_2)

    #Look at word weights
    print("Word weights of best estimators for each target")
    show_word_weights(report_dict_nb_2)

    #looking at the learning curve
    print("Leanring curve of models")
    show_learning_curve(report_dict_nb_2)

### b. **Stochastic Gradient Descent**
> i. **Incremental (from unigram only up to range of unigram to fivegram)** <br/>
> ii. **Individual (from unigram only up to fivegram only)**

In [None]:
if __name__ == "__main__":

    #Load data
    train = pd.read_csv("./06_Session 6 - NLP/Advanced_AI_NLP/agnews_train.csv")
    test = pd.read_csv("./06_Session 6 - NLP/Advanced_AI_NLP/agnews_test.csv")
        
    ### i

    #Define Stochastic Gradient Descent model using incremental method up to fivegram
    param_grid = define_param_grid(model = sgd, ngram = 5, method = 1)
    grid = define_gridsearch(model = sgd, param_grid = param_grid, scorer = "accuracy")
    
    #Create reporting data
    report_dict = apply_modelling(grid, train, test)
    report_dict_sgd_1 = create_learningcurve(train, test, grid.best_estimator_, accuracy_score, report_dict) #can choose any estimator; don't have to choose best_estimator here

    #Saving the results of the testing into 
    df = pd.DataFrame(report_dict_sgd_1) 
    df.to_csv (r"./06_Session 6 - NLP/Advanced_AI_NLP/SGD_tuned_test_series_data_#2.csv",index = False, header=True)

    ##Analysis##

    #looking at the performance of the different hyperparameters in the gridsearch
    print("CV report for each target")
    show_cv_report(report_dict_sgd_1)

    #Looking at the classification report of the best estimator
    print("Classification reports of best estimators")
    show_classification_report(report_dict_sgd_1)

    #Looking at the parameters of the best estimator
    print("Parameters of best estimators")
    show_best_estimator(report_dict_sgd_1)

    #Look at word weights
    print("Word weights of best estimators for each target")
    show_word_weights(report_dict_sgd_1)

    #looking at the learning curve
    print("Leanring curve of models")
    show_learning_curve(report_dict_sgd_1)
    
    ### ii

    #Define Stochastic Gradient Descent model using individual method up to fivegram
    param_grid = define_param_grid(model = sgd, ngram = 5, method = 2)
    grid = define_gridsearch(model = sgd, param_grid = param_grid, scorer = "accuracy")
    
    #Create reporting data
    report_dict = apply_modelling(grid, train, test)
    report_dict_sgd_2 = create_learningcurve(train, test, grid.best_estimator_, accuracy_score, report_dict) #can choose any estimator; don't have to choose best_estimator here

    #Saving the results of the testing into 
    df = pd.DataFrame(report_dict_sgd_2) 
    df.to_csv (r"./06_Session 6 - NLP/Advanced_AI_NLP/SGD_tuned_test_series_data.csv",index = False, header=True)

    ##Analysis##

    #looking at the performance of the different hyperparameters in the gridsearch
    print("CV report for each target")
    show_cv_report(report_dict_sgd_2)

    #Looking at the classification report of the best estimator
    print("Classification reports of best estimators")
    show_classification_report(report_dict_sgd_2)

    #Looking at the parameters of the best estimator
    print("Parameters of best estimators")
    show_best_estimator(report_dict_sgd_2)

    #Look at word weights
    print("Word weights of best estimators for each target")
    show_word_weights(report_dict_sgd_2)

    #looking at the learning curve
    print("Leanring curve of models")
    show_learning_curve(report_dict_sgd_2)

### c. Compare Precision-Recall Curve for Models in 6 (ai, aii, bi, bii)

In [None]:
#Comparing the precision recall curve of different models
report_dicts = [report_dict_nb_1, report_dict_nb_2, report_dict_sgd_1, report_dict_sgd_2]
show_precision_recall_curve(report_dicts, test)

## 7. *Multiclass Models*: *Hyperparameter Tuned (up to Fivegram)*

In this section, we will define multiclass classification which is a text classification task with more than two classes or targets. Each data sample can be classified into one of the classes. However, a data sample cannot belong to more than one class simultaneously.

In this case, the model will classify news headlines into the corresponding categories: science, sports, world and business using only one combination of parameters. 

As was expected, the accuracy was lower than those of the single class models.

### a. **Multinomial Naive Bayes**
> **Incremental (from unigram only up to range of unigram to fivegram)** <br/>

In [None]:
if __name__ == "__main__":

    #Load data
    train = pd.read_csv("./06_Session 6 - NLP/Advanced_AI_NLP/agnews_train.csv")
    test = pd.read_csv("./06_Session 6 - NLP/Advanced_AI_NLP/agnews_test.csv") 

    ####### Hyperparameter tuning and ngram models ######

    #Define Multinomial Naive Bayes model using incremental method up to fivegram
    param_grid = define_param_grid(model = mr_naivebayes_multiclass, ngram = 5, method = 1)
    grid = define_gridsearch(model = mr_naivebayes_multiclass, param_grid = param_grid, scorer = "accuracy")
    
    #Create reporting data
    report_dict = apply_modelling(grid, train, test)
    report_dict_multi_nb = create_learningcurve(train, test, grid.best_estimator_, accuracy_score, report_dict) #can choose any estimator; don't have to choose best_estimator here

    #Saving the results of the testing into 
    df = pd.DataFrame(report_dict_multi_nb) 
    df.to_csv (r"./06_Session 6 - NLP/Advanced_AI_NLP/NB_Multiiclass_tuned_test_series_data.csv",index = False, header=True)

    ##Analysis##

    #looking at the performance of the different hyperparameters in the gridsearch
    print("CV report for each target")
    show_cv_report(report_dict_multi_nb)

    #Looking at the classification report of the best estimator
    print("Classification reports of best estimators")
    show_classification_report(report_dict_multi_nb)

    #Looking at the parameters of the best estimator
    print("Parameters of best estimators")
    show_best_estimator(report_dict_multi_nb)

    #Look at word weights
    print("Word weights of best estimators for each target")
    show_word_weights(report_dict_multi_nb)

    #looking at the learning curve
    print("Leanring curve of models")
    show_learning_curve(report_dict_multi_nb)

### b. **Stochastic Gradient Descent**
> **Incremental (from unigram only up to range of unigram to fivegram)** <br/>

In [None]:
if __name__ == "__main__":

    #Load data
    train = pd.read_csv("./06_Session 6 - NLP/Advanced_AI_NLP/agnews_train.csv")
    test = pd.read_csv("./06_Session 6 - NLP/Advanced_AI_NLP/agnews_test.csv")

    #Define Stochastic Gradient Descent model using incremental method up to fivegram
    param_grid = define_param_grid(model = sgd_multiclass, ngram = 5, method = 1)
    grid = define_gridsearch(model = sgd_multiclass, param_grid = param_grid, scorer = "accuracy")
    
    #Create reporting data
    report_dict = apply_modelling(grid, train, test)
    report_dict_multi_sgd = create_learningcurve(train, test, grid.best_estimator_, accuracy_score, report_dict) #can choose any estimator; don't have to choose best_estimator here

    #Saving the results of the testing into 
    df = pd.DataFrame(report_dict_multi_sgd) 
    df.to_csv (r"./06_Session 6 - NLP/Advanced_AI_NLP/SGD_Multiclass_tuned_test_series_data.csv",index = False, header=True)

    ##Analysis##

    #looking at the performance of the different hyperparameters in the gridsearch
    print("CV report for each target")
    show_cv_report(report_dict_multi_sgd)

    #Looking at the classification report of the best estimator
    print("Classification reports of best estimators")
    show_classification_report(report_dict_multi_sgd)

    #Looking at the parameters of the best estimator
    print("Parameters of best estimators")
    show_best_estimator(report_dict_multi_sgd)

    #Look at word weights
    print("Word weights of best estimators for each target")
    show_word_weights(report_dict_multi_sgd)

    #looking at the learning curve
    print("Leanring curve of models")
    show_learning_curve(report_dict_multi_sgd)

### c. Compare Precision-Recall Curve for Models in 7 (a, b)

In [None]:
#Comparing the precision recall curve of different multiclass models
multiclass_report_dicts = [report_dict_multi_nb, report_dict_multi_sgd]
show_multiclassprecision_recall_curve(multiclass_report_dicts, test)