# CS 221 Project - Classifier
Using the pertinent text features to make a pertinent Naive Bayes Classifier.

## Libraries
Using NLTK for NLP-related parts.

In [10]:
import math
import matplotlib.pyplot as plt
import nltk
import numpy as np
import pandas as pd
import random
from sklearn.linear_model import LogisticRegression
from nltk.classify.scikitlearn import SklearnClassifier

## Loading in Data
Using the data from `last_data.csv`, which is generated from `features.ipynb`.

In [11]:
main_df = pd.read_csv('../data/last_data.csv')

## Generating Feature Vector
Using lexical features extracted to produce feature vector for a corresponding TED talk.

In [12]:
# List of features
features = ['length','diversity','virality', 'noun_1','adjective_1','noun_2','adjective_2','noun_3','adjective_3']

def new_ted_talk_features(data_point):
    # Initialize dictionary of features
    feature_dict = {}
    
    # Iterate through each of the features and set proper value
    for i in range(len(features)):
        if (i < 3):
            value = float(data_point[features[i]])
            if (math.isnan(value)):
                feature_dict[features[i]] = 0
            else:
                feature_dict[features[i]] = float(data_point[features[i]])
        else:
            value = str(data_point[features[i]])
            if value == 'nan' or value is None or len(value) < 1:
                feature_dict[features[i]] = ''
            else:
                feature_dict[features[i]] = (data_point[features[i]])
    return feature_dict

## Creating Feature Sets
Using above function to map the titles to features vectors to their labels.

In [14]:
# Iterate through each class and load in the CSV
for index, data_point in main_df.iterrows():
    feature_dict = new_ted_talk_features(data_point)
    feature_sets.append(((feature_dict), data_point["percentile"]))

## Creating the Train and Test Sets
Shuffling the dataset, setting the training set to $80$ percent of the data, and then set aside.

In [15]:
random.shuffle(feature_sets)
train_len = int(len(feature_sets) * 0.8)
train_set, test_set = feature_sets[:train_len], feature_sets[train_len:]

## Training the Classifer and Outputting Accuracy
Training the Logistic Regression algorithm.

In [16]:
classif = SklearnClassifier(LogisticRegression(max_iter=300))
classif.train(train_set)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


<SklearnClassifier(LogisticRegression(max_iter=300))>

## Calculating Error of Algorithm
Calculating the least squared error in order to validate performance.

In [17]:
# Creating test set of data to pass into classifier
test_features = []
test_value = []
for i in range(len(test_set)):
    test_features.append(test_set[i][0])
    test_value.append(test_set[i][1])
final_preds = (classif.classify_many(test_features))

# Calculating the error
sum = 0
baseSum = 0
ignore = 0
for i in range(len(test_value)):
    if math.isnan(final_preds[i]):
        ignore += 1
    else:
        sum += abs(final_preds[i] - test_value[i]) ** 2
        baseSum += abs(final_preds[i] - .5) ** 2

# Printing Final Error
print(sum / (len(test_value) - ignore))
print(baseSum / (len(test_value) - ignore))

0.19411657301048715
0.1105510790801164


## Old Approach: Naive Bayes Classifier
Code from our old approach, which is covered in "Error Analysis."

In [18]:
classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, test_set))

0.7965587044534413


## Miscellaneous
Other miscellany code.

In [165]:
print(sum)
print(final_preds[58])
print(ignore)

176.31493506493499
0.1607142857142857
2
