# Assignment2
## Text Classificiation

### Task:
Client has requested that we take content from websites and classify them as
* business
* politics
* tech
* entertainment
* sports

### Data:
News | Category
---- | --------
US cyber security chief resigns... | tech
Italy 8-38 Wales  Wales secured their first away win in the RBS Six Nations...| sports
... | ...


## Steps:
### Preprocessing
Data presented is sparse, content and classification only so no real need for data preprocessing.

### Data Evaluation
Again, since only content and category are given so not much to go on here. Took data and transformed it with count vectorizer. This gives us  word counts for each word in a matrix.

### Model Selection
Tested multiple models:
* Logistic Regression
* K Nearest Neighbors
* SVR
* KSVR

and found Logistic Regression was the best option with 98.38%

## Summary:
After Vectorization of the data, I found that Logistic Regression tested best with 98.38% accuracy.

In [1]:
# Import tools for data
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [2]:
# Reading in Data
df = pd.read_csv("./Assignment2_BBCNewsData.csv")
X = df.iloc[: , 1].values
y = df.iloc[:, -1].values

### Label Encoding

In [3]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
y = le.fit_transform(y)

### Mlflow
Using Mlflow to save models.

To run I used:
```powershell
mlflow server --backend-store-uri 'C:/temp/mlflow/localserver'
```

In [4]:
import mlflow.sklearn

mlflow.set_tracking_uri("http://localhost:5000")
mlflow.set_experiment("Assignment2 Experiment")

#### Split data and vectorized words

In [5]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer

sentences_train, sentences_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=10)

vectorizer = CountVectorizer()
vectorizer.fit(sentences_train)
X_train = vectorizer.transform(sentences_train)
X_test  = vectorizer.transform(sentences_test)
X_train


<1668x26282 sparse matrix of type '<class 'numpy.int64'>'
	with 339442 stored elements in Compressed Sparse Row format>

## Testing Models:

### Logistic Regression

In [6]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, accuracy_score

with mlflow.start_run(run_name="Basic LR Experiment") as run:

    log_reg_classifier = LogisticRegression(random_state = 10)
    log_reg_classifier.fit(X_train, y_train)
    log_reg_y_pred = log_reg_classifier.predict(X_test)

    log_reg_cm = confusion_matrix(y_test, log_reg_y_pred)
    print(log_reg_cm)
    log_reg_score = accuracy_score(y_test, log_reg_y_pred)
    
    # Log model
    mlflow.sklearn.log_model(log_reg_classifier, "LR-model")

    # Create metrics
    print(f"score: {log_reg_score}")

    # Log metrics
    mlflow.log_metric("score", log_reg_score)

    runID = run.info.run_uuid
    experimentID = run.info.experiment_id

    print(f"Inside MLflow Run with run_id `{runID}` and experiment_id `{experimentID}`")


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[[128   2   1   0   0]
 [  1 104   0   0   0]
 [  1   0  91   0   0]
 [  1   0   0 115   0]
 [  1   2   0   0 110]]
score: 0.9838420107719928
Inside MLflow Run with run_id `590f07e8d5414e5a88425aec7f3de393` and experiment_id `1`


### KNN

In [7]:
from sklearn.neighbors import KNeighborsClassifier


with mlflow.start_run(run_name="Basic KNN Experiment") as run:
    knn_classifier = KNeighborsClassifier(n_neighbors = 5, metric = 'minkowski', p = 2)
    knn_classifier.fit(X_train, y_train)
    knn_y_pred = knn_classifier.predict(X_test)

    knn_cm = confusion_matrix(y_test, knn_y_pred)
    print(knn_cm)
    knn_score = accuracy_score(y_test, knn_y_pred)

    # Log model
    mlflow.sklearn.log_model(knn_classifier, "KNN-model")

    # Create metrics
    print(f"score: {knn_score}")

    # Log metrics
    mlflow.log_metric("score", knn_score)

    runID = run.info.run_uuid
    experimentID = run.info.experiment_id

    print(f"Inside MLflow Run with run_id `{runID}` and experiment_id `{experimentID}`")

[[104   1   4  20   2]
 [  7  67   4  27   0]
 [ 12   5  68   5   2]
 [  7   1   2 106   0]
 [ 23   6   4  10  70]]
score: 0.7450628366247756
Inside MLflow Run with run_id `d7f022182d974c7f9c517b3a3bfa4865` and experiment_id `1`


### SVR

In [8]:
from sklearn.svm import SVC

with mlflow.start_run(run_name="Basic SVR Experiment") as run:
    svr_classifier = SVC(kernel = 'linear', random_state = 0)
    svr_classifier.fit(X_train, y_train)
    svr_y_pred = svr_classifier.predict(X_test)

    svr_cm = confusion_matrix(y_test, svr_y_pred)
    print(svr_cm)
    svr_score = accuracy_score(y_test, svr_y_pred)
    svr_score

    # Log model
    mlflow.sklearn.log_model(svr_classifier, "SVR-model")

    # Create metrics
    print(f"score: {svr_score}")

    # Log metrics
    mlflow.log_metric("score", svr_score)

    runID = run.info.run_uuid
    experimentID = run.info.experiment_id

    print(f"Inside MLflow Run with run_id `{runID}` and experiment_id `{experimentID}`")

[[124   4   2   0   1]
 [  1 104   0   0   0]
 [  2   0  90   0   0]
 [  2   0   0 114   0]
 [  0   2   0   0 111]]
score: 0.9748653500897666
Inside MLflow Run with run_id `6aba42da8d574d44a9cdb28bda082775` and experiment_id `1`


### Kernel SVR

In [9]:
from sklearn.svm import SVC

with mlflow.start_run(run_name="Basic KSVR Experiment") as run:
    ksvr_classifier = SVC(kernel = 'rbf', random_state = 0)
    ksvr_classifier.fit(X_train, y_train)
    ksvr_y_pred = ksvr_classifier.predict(X_test)

    ksvr_cm = confusion_matrix(y_test, ksvr_y_pred)
    print(ksvr_cm)
    ksvr_score = accuracy_score(y_test, ksvr_y_pred)

    # Log model
    mlflow.sklearn.log_model(ksvr_classifier, "KSVR-model")

    # Create metrics
    print(f"score: {ksvr_score}")

    # Log metrics
    mlflow.log_metric("score", ksvr_score)

    runID = run.info.run_uuid
    experimentID = run.info.experiment_id

    print(f"Inside MLflow Run with run_id `{runID}` and experiment_id `{experimentID}`")

[[125   2   3   0   1]
 [  4  97   0   4   0]
 [  3   0  88   0   1]
 [  1   0   0 114   1]
 [  6   0   0   0 107]]
score: 0.9533213644524237
Inside MLflow Run with run_id `3e98d6f366294fbd8cdbd59a6ed2039a` and experiment_id `1`


## Testing ANN

In [17]:
from keras.models import Sequential
from keras import layers
input_dim = X_train.shape[1]  # Number of features
model = Sequential()
model.add(layers.Dense(10, input_dim=input_dim, activation='relu'))
model.add(layers.Dense(10, input_dim=input_dim, activation='relu'))
model.add(layers.Dense(1, activation='softmax'))

model.compile(loss='categorical_crossentropy', 
               optimizer='adam', 
               metrics=['accuracy'])
model.summary()

Model: "sequential_3"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_8 (Dense)              (None, 10)                262830    
_________________________________________________________________
dense_9 (Dense)              (None, 10)                110       
_________________________________________________________________
dense_10 (Dense)             (None, 1)                 11        
Total params: 262,951
Trainable params: 262,951
Non-trainable params: 0
_________________________________________________________________


In [18]:
history = model.fit(X_train, y_train,
                     epochs=100,
                     verbose=False,
                     validation_data=(X_test, y_test),
                     batch_size=10)

In [19]:
loss, accuracy = model.evaluate(X_train, y_train, verbose=False)
print("Training Accuracy: {:.4f}".format(accuracy))
loss, accuracy = model.evaluate(X_test, y_test, verbose=False)
print("Testing Accuracy:  {:.4f}".format(accuracy))


Training Accuracy: 0.1685
Testing Accuracy:  0.1885
