# Support Vector Machines & Clustering

Problema propusa este identificarea autorului unui text si gasirea similaritatilor dintre carti, analizand stilul de scriere al autorilor comparand diferite modele cu diversi hiper-parametrii. Setul de date contine mai mlte texte ale fiecarui autor.

Modelele ce trebuie folosite si comparate sunt: SVM, DBCAN, KMeans, Agglomerative.

## Citirea setului de date

In [None]:
import numpy as np
import pandas as pd

dataset = pd.read_csv("Ciolacu C. Florentina-Neluta.csv")

## Encodarea autorilor

Encodarea autorilor cu numere de la 0 la 19

In [None]:
authors = list(dataset.columns.values)
authors.pop(0)
authors_dict = {author: index for index, author in enumerate(authors)}

new_dataset = pd.DataFrame(columns=["text", "author"])
for el in authors_dict:
    aux = pd.DataFrame(columns=["text"])
    aux["text"] = dataset[el]
    aux["author"] = pd.Series([authors_dict.get(el)] * len(dataset[el]))
    new_dataset = new_dataset.append(aux)

## Impartirea setului de date in features si labels

Am ales algoritmul TfidfVectorizer pentru extragerea feature-urilor din texte.
De asemenea, am renuntat la cuvintele din alte limbi in afara de engleza.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(strip_accents='ascii', stop_words='english', token_pattern=r'(?u)\b[A-Za-z]+\b')
X = vectorizer.fit_transform(new_dataset.text)
y = new_dataset.author
y = y.astype('int')

## Vizualizarea datelor

Plotarea in functie de autor. Reducerea dimensionalitatii pentru features a fost facuta cu TruncatedSVD.

In [None]:
import matplotlib.pyplot as plt
from sklearn.decomposition import TruncatedSVD
from mpl_toolkits.mplot3d import Axes3D

classifier = TruncatedSVD(n_components=3)
Xplot = classifier.fit_transform(X)
Xplot = pd.DataFrame(Xplot)
yplot = pd.DataFrame(y)

fig = plt.figure()
ax = Axes3D(fig)
ax.scatter(Xplot[0], Xplot[1], Xplot[2], c=yplot["author"], s=50,
           cmap="gnuplot",
           edgecolor="black", linewidth=0.5)

# Task 1 - Support Vector Machines

Text Classification folosind Support Vector Machines pentru a prezice autorul fiecarui text.

## Impartirea setului de date in train si test

In [None]:
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(X, y)

## Determinarea scorului cel mai bun

Pentru a obtine cel mai bun scor, am variat modelele prin setul de parametrii(kernel, C, gamma) si am afisat scorurile obtinute.

In [None]:
from sklearn.metrics import precision_score, recall_score, f1_score, accuracy_score
from sklearn.svm import SVC

best_svm_model = (None, None, -1.0, -1.0, -1.0)

for i, (Model, kwargs) in enumerate([(SVC, {"C": 1.0, "gamma": 0.0001, "kernel": "rbf"}), (SVC, {"C": 1.0, "gamma": 1.0, "kernel": "rbf"}), (SVC, {"C": 0.1, "gamma": 0.0001, "kernel": "rbf"}), (SVC, {"C": 0.5, "gamma": 1.0, "kernel": "rbf"}),
                                     (SVC, {"C": 1.0, "gamma": 0.0001, "kernel": "linear"}), (SVC, {"C": 1.0, "gamma": 1.0, "kernel": "linear"}),
                                     (SVC, {"C": 1.0, "gamma": 0.0001, "kernel": "poly"}), (SVC, {"C": 1.0, "gamma": 1.0, "kernel": "poly"}), ]):
    accuracy = 0.0
    precision = 0.0
    f1_score_ = 0.0
    model = Model(**kwargs)
    model.fit(x_train, y_train)
    accuracy = accuracy_score(y_test, model.predict(x_test))
    precision = precision_score(y_test, model.predict(x_test), average="weighted")
    f1_score_ = f1_score(y_test, model.predict(x_test), average="weighted")
    if best_svm_model[2] < accuracy:
        best_svm_model = (model, kwargs, accuracy, precision, f1_score_)
    print(Model.__name__, kwargs)
    print("Average accuracy:", accuracy)
    print("Average precision:", precision)
    print("Average f1 score:", f1_score_)
    print("\n\n")

In [None]:
print("The best results were with the model: ", best_svm_model[1])
print("The scores obtained:")
print("Accuracy:", best_svm_model[2])
print("Precision:", best_svm_model[3])
print("f1 score:", best_svm_model[4])

## Reducerea dimensionalitatii

Am incercat o reducere a dimensionalitatii deoarece datele aveau o dimensionalitate foarte mare. Utilizand algoritmul PCA, am verificat imbunatatirea rezultatelor.
Concluzia este ca reducerea dimensionalitatii ajuta, intrucat scorurile obtinute au fost mai mari. 

In [None]:
pca_dimensions_list = [400, 200, 100]
best_svm_model = [(None, None, -1.0, -1.0, -1.0)] * len(pca_dimensions_list)

for j, pca_dimensions in enumerate(pca_dimensions_list):
    clf = TruncatedSVD(n_components=pca_dimensions)
    Xpca = clf.fit_transform(X)

    x_train, x_test, y_train, y_test = train_test_split(Xpca, y)
    
    print("For PCA with:", pca_dimensions)
    
    for i, (Model, kwargs) in enumerate([(SVC, {"C": 1.0, "gamma": 0.0001, "kernel": "rbf"}), (SVC, {"C": 1.0, "gamma": 1.0, "kernel": "rbf"}), (SVC, {"C": 0.1, "gamma": 0.0001, "kernel": "rbf"}), (SVC, {"C": 0.5, "gamma": 1.0, "kernel": "rbf"}),
                                         (SVC, {"C": 1.0, "gamma": 0.0001, "kernel": "linear"}), (SVC, {"C": 1.0, "gamma": 1.0, "kernel": "linear"}),
                                         (SVC, {"C": 1.0, "gamma": 0.0001, "kernel": "poly"}), (SVC, {"C": 1.0, "gamma": 1.0, "kernel": "poly"}),]):
        accuracy = 0.0
        precision = 0.0
        f1_score_ = 0.0
        model = Model(**kwargs)
        model.fit(x_train, y_train)
        accuracy = accuracy_score(y_test, model.predict(x_test))
        precision = precision_score(y_test, model.predict(x_test), average="weighted")
        f1_score_ = f1_score(y_test, model.predict(x_test), average="weighted")
        if best_svm_model[j][2] < accuracy:
            best_svm_model[j] = (model, kwargs, accuracy, precision, f1_score_)
        print(Model.__name__, kwargs)
        print("Average accuracy:", accuracy)
        print("Average precision:", precision)
        print("Average f1 score:", f1_score_)
        print("\n\n")

In [None]:
for i, dimensions in enumerate(pca_dimensions_list):
    print("The best results for PCA with dimensions:", dimensions)
    print(best_svm_model[i][1])
    print("Accuracy:", best_svm_model[i][2])
    print("Precision:", best_svm_model[i][3])
    print("f1 score:", best_svm_model[i][4])
    print("\n\n")

# Task 2 - Clustering

Text Clustering folosind metodele de Clustering: DBSCAN, KMeans and Hierarchical(Agglomerative)

## DBSCAN - Determinarea scorului cel mai bun

Pentru a obtine cel mai bun scor, am variat modelele prin setul de parametrii(eps, min_samples) si am afisat scorurile obtinute.
Metricele folosite: silhouette score, homogeneity score, completeness score

In [None]:
from sklearn.cluster import DBSCAN
from sklearn.metrics import homogeneity_score, silhouette_score, completeness_score

best_dbscan_model = (None, None, -1.0, -1.0, -1.0)

eps = np.arange(0.5, 1, 0.1)
min_samples = np.arange(2, 6, 1)

models = []

for eps_val in eps:
    for min_sample in min_samples:
        models.append((DBSCAN, {"eps": eps_val, "min_samples": min_sample}))

for i, (Model, kwargs) in enumerate(models):
    homogeneity = 0.0
    silhouette = 0.0
    completeness = 0.0
    model = Model(**kwargs)
    model.fit(X)
    unique, counts = np.unique(model.labels_, return_counts=True)
    if len(dict(zip(unique, counts))) > 1:
        homogeneity = homogeneity_score(y, model.labels_)
        silhouette = silhouette_score(X, model.labels_)
        completeness = completeness_score(y, model.labels_)
        if best_dbscan_model[3] < silhouette:
            best_dbscan_model = (model, kwargs, homogeneity, silhouette, completeness)
    print(Model.__name__, kwargs)
    print(dict(zip(unique, counts)))
    print("Homogeneity: ", homogeneity)
    print("Silhouette: ", silhouette)
    print("Completeness: ", completeness)
    print("\n\n")

In [None]:
print("The best results were with the DBSCAN model: ", best_dbscan_model[1])
print("The results obtained: ")
print("Homogeneity:", best_dbscan_model[2])
print("Silhouette:", best_dbscan_model[3])
print("Completeness:", best_dbscan_model[4])

### Reducerea dimensionalitatii

Am incercat o reducere a dimensionalitatii deoarece datele aveau o dimensionalitate foarte mare. Utilizand algoritmul PCA, am verificat imbunatatirea rezultatelor. Concluzia este ca reducerea dimensionalitatii nu aduce modificari prea mari.

In [None]:
pca_dimensions_list = [400, 200, 100]
best_dbscan_model = [(None, None, -1.0, -1.0, -1.0)] * len(pca_dimensions_list)

for j, pca_dimensions in enumerate(pca_dimensions_list):
    clf = TruncatedSVD(n_components=pca_dimensions)
    Xpca = clf.fit_transform(X)
    
    eps = np.arange(0.5, 1, 0.1)
    min_samples = np.arange(2, 6, 1)
    
    models = []
    
    for eps_val in eps:
        for min_sample in min_samples:
            models.append((DBSCAN, {"eps": eps_val, "min_samples": min_sample}))
            
    print("Dimensions:", pca_dimensions)
    print("\n\n")
    
    for i, (Model, kwargs) in enumerate(models):
        homogeneity = 0.0
        silhouette = 0.0
        completeness = 0.0
        model = Model(**kwargs)
        model.fit(Xpca)
        unique, counts = np.unique(model.labels_, return_counts=True)
        if len(dict(zip(unique, counts))) > 1:
            homogeneity = homogeneity_score(y, model.labels_)
            silhouette = silhouette_score(Xpca, model.labels_)
            completeness = completeness_score(y, model.labels_)
            if best_dbscan_model[j][3] < silhouette:
                best_dbscan_model[j] = (model, kwargs, homogeneity, silhouette, completeness)
        print(Model.__name__, kwargs)
        print(dict(zip(unique, counts)))
        print("Homogeneity:", homogeneity)
        print("Silhouette:", silhouette)
        print("Completeness:", completeness)
        print("\n\n")

In [None]:
for i, dimensions in enumerate(pca_dimensions_list):
    print("The best results for PCA with dimensions:", dimensions)
    print(best_dbscan_model[i][1])
    print("Homogeneity:", best_dbscan_model[i][2])
    print("Silhouette:", best_dbscan_model[i][3])
    print("Completeness:", best_dbscan_model[i][4])
    print("\n\n")

## KMeans - Determinarea scorului cel mai bun

Pentru a obtine cel mai bun scor, am variat modelele prin setul de parametrii(n_clusters, init) si am afisat scorurile obtinute.
Metricele folosite: silhouette score, homogeneity score, completeness score

In [None]:
from sklearn.cluster import KMeans
from sklearn.metrics import homogeneity_score, silhouette_score, completeness_score

best_kmeans_model = (None, None, -1.0, -1.0, -1.0)

n_clusters = np.arange(10, 20, 1)
inits = ['k-means++', 'random']

models = []

for n in n_clusters:
    for init in inits:
        models.append((KMeans, {"n_clusters": n, "init": init}))

for i, (Model, kwargs) in enumerate(models):
    print(Model.__name__, kwargs)
    homogeneity = 0.0
    silhouette = 0.0
    completeness = 0.0
    model = Model(**kwargs)
    model.fit(X)
    unique, counts = np.unique(model.labels_, return_counts=True)
    homogeneity = homogeneity_score(y, model.labels_)
    silhouette = silhouette_score(X, model.labels_)
    completeness = completeness_score(y, model.labels_)
    if best_kmeans_model[3] < silhouette:
        best_kmeans_model = (model, kwargs, homogeneity, silhouette, completeness)
    print("Homogeneity:", homogeneity)
    print("Silhouette:", silhouette)
    print("Completeness:", completeness)
    print("\n\n")

In [None]:
print("The best results were with the KMeans model: ", best_kmeans_model[1])
print("The results obtained: ")
print("Homogeneity:", best_kmeans_model[2])
print("Silhouette:", best_kmeans_model[3])
print("Completeness:", best_kmeans_model[4])

### Reducerea dimensionalitatii

Am incercat o reducere a dimensionalitatii deoarece datele aveau o dimensionalitate foarte mare. Utilizand algoritmul PCA, am verificat imbunatatirea rezultatelor. Concluzia este ca reducerea dimensionalitatii aduce modificari, dar in sens negativ.

In [None]:
pca_dimensions_list = [400, 10, 5]
best_kmeans_model = [(None, None, -1.0, -1.0, -1.0)] * len(pca_dimensions_list)

for j, pca_dimensions in enumerate(pca_dimensions_list):
    clf = TruncatedSVD(n_components=pca_dimensions)
    Xpca = clf.fit_transform(X)
    
    n_clusters = np.arange(15, 21, 1)
    inits = ['k-means++', 'random']
    
    models = []
    
    for n in n_clusters:
        for init in inits:
            models.append((KMeans, {"n_clusters": n, "init": init}))

    print("Dimensions:", pca_dimensions)
    print("\n\n")
    
    for i, (Model, kwargs) in enumerate(models):
        print(Model.__name__, kwargs)
        homogeneity = 0.0
        silhouette = 0.0
        completeness = 0.0
        model = Model(**kwargs)
        model.fit(Xpca)
        unique, counts = np.unique(model.labels_, return_counts=True)
        homogeneity = homogeneity_score(y, model.labels_)
        silhouette = silhouette_score(Xpca, model.labels_)
        completeness = completeness_score(y, model.labels_)
        # get best model to print after
        if best_kmeans_model[j][3] < silhouette:
            best_kmeans_model[j] = (model, kwargs, homogeneity, silhouette, completeness)
        print("Homogeneity:", homogeneity)
        print("Silhouette:", silhouette)
        print("Completeness:", completeness)
        print("\n\n")

In [None]:
for i, dimensions in enumerate(pca_dimensions_list):
    print("The best results for PCA with dimensions:", dimensions)
    print(best_kmeans_model[i][1])
    print("Homogeneity:", best_kmeans_model[i][2])
    print("Silhouette:", best_kmeans_model[i][3])
    print("Completeness:", best_kmeans_model[i][4])
    print("\n\n")

### Plotarea modelelor cele mai bune de KMeans

In [None]:
pca_classifier = TruncatedSVD(n_components=3)
Xpca_plot = pca_classifier.fit_transform(X)
Xpca_plot = pd.DataFrame(Xpca_plot)

model = KMeans(n_clusters=14, init='k-means++')
model.fit(X)
y_plot = pd.DataFrame(model.labels_)

fig = plt.figure()
ax = Axes3D(fig)
ax.scatter(Xpca_plot[0], Xpca_plot[1], Xpca_plot[2], c=y_plot[0], s=50,
           cmap="gnuplot",
           edgecolor="black", linewidth=0.5)

In [None]:
for i, (Model, kwargs, dimensions) in enumerate([(KMeans, {'n_clusters': 19, 'init': 'k-means++'}, 400), (KMeans, {'n_clusters': 19, 'init': 'k-means++'}, 10), (KMeans, {'n_clusters': 17, 'init': 'k-means++'}, 5)]):
    print(Model.__name__, kwargs, "with: ", dimensions)
    clf = TruncatedSVD(n_components=dimensions)
    Xpca = clf.fit_transform(X)
    model = Model(**kwargs)
    model.fit(Xpca)
    y_plot = pd.DataFrame(model.labels_)
    fig = plt.figure(i)
    ax = Axes3D(fig)
    ax.scatter(Xpca_plot[0], Xpca_plot[1], Xpca_plot[2], c=y_plot[0], s=50,
           cmap="gnuplot",
           edgecolor="black", linewidth=0.5)

## Agglomerative - Determinarea scorului cel mai bun

Pentru a obtine cel mai bun scor, am variat modelele prin setul de parametrii(n_clusters) si am afisat scorurile obtinute.
Metricele folosite: silhouette score, homogeneity score, completeness score

In [None]:
from sklearn.cluster import AgglomerativeClustering
from sklearn.metrics import homogeneity_score, silhouette_score, completeness_score

best_agglomerative_model = (None, None, -1.0, -1.0, -1.0)

n_clusters = np.arange(2, 11, 1)

models = []

for n in n_clusters:
    models.append((AgglomerativeClustering, {"n_clusters": n}))

for i, (Model, kwargs) in enumerate(models):
    print(Model.__name__, kwargs)
    homogeneity = 0.0
    silhouette = 0.0
    completeness = 0.0
    model = Model(**kwargs)
    model.fit(X.toarray())
    unique, counts = np.unique(model.labels_, return_counts=True)
    homogeneity = homogeneity_score(y, model.labels_)
    silhouette = silhouette_score(X, model.labels_)
    completeness = completeness_score(y, model.labels_)
    if best_agglomerative_model[3] < silhouette:
        best_agglomerative_model = (model, kwargs, homogeneity, silhouette, completeness)
    print("Homogeneity:", homogeneity)
    print("Silhouette:", silhouette)
    print("Completeness:", completeness)
    print("\n\n")

In [None]:
print("The best results were with the Agglomerative model: ", best_agglomerative_model[1])
print("The results obtained: ")
print("Homogeneity:", best_agglomerative_model[2])
print("Silhouette:", best_agglomerative_model[3])
print("Completeness:", best_agglomerative_model[4])

### Reducerea dimensionalitatii

Am incercat o reducere a dimensionalitatii deoarece datele aveau o dimensionalitate foarte mare. Utilizand algoritmul PCA, am verificat imbunatatirea rezultatelor. Concluzia este ca reducerea dimensionalitatii nu aduce modificari prea mari.

In [None]:
pca_dimensions_list = [400, 10, 5]
best_agglomerative_model = [(None, None, -1.0, -1.0, -1.0)] * len(pca_dimensions_list)

for j, pca_dimensions in enumerate(pca_dimensions_list):
    clf = TruncatedSVD(n_components=pca_dimensions)
    Xpca = clf.fit_transform(X)
    
    n_clusters = np.arange(2, 11, 1)
    
    models = []
    
    for n in n_clusters:
        models.append((AgglomerativeClustering, {"n_clusters": n}))

    print("Dimensions:", pca_dimensions)
    print("\n\n")
    
    for i, (Model, kwargs) in enumerate(models):
        print(Model.__name__, kwargs)
        homogeneity = 0.0
        silhouette = 0.0
        completeness = 0.0
        model = Model(**kwargs)
        model.fit(Xpca)
        unique, counts = np.unique(model.labels_, return_counts=True)
        homogeneity = homogeneity_score(y, model.labels_)
        silhouette = silhouette_score(Xpca, model.labels_)
        completeness = completeness_score(y, model.labels_)
        if best_agglomerative_model[j][3] < silhouette:
            best_agglomerative_model[j] = (model, kwargs, homogeneity, silhouette, completeness)
        print("Homogeneity:", homogeneity)
        print("Silhouette:", silhouette)
        print("Completeness:", completeness)
        print("\n\n")

In [None]:
for i, dimensions in enumerate(pca_dimensions_list):
    print("The best results for PCA with dimensions:", dimensions)
    print(best_agglomerative_model[i][1])
    print("Homogeneity:", best_agglomerative_model[i][2])
    print("Silhouette:", best_agglomerative_model[i][3])
    print("Completeness:", best_agglomerative_model[i][4])
    print("\n\n")

### Plotarea modelelor cele mai bune de Agglomerative

In [None]:
model = AgglomerativeClustering(n_clusters=10)
model.fit(X.toarray())
y_plot = pd.DataFrame(model.labels_)

fig = plt.figure()
ax = Axes3D(fig)
ax.scatter(Xpca_plot[0], Xpca_plot[1], Xpca_plot[2], c=y_plot[0], s=50,
           cmap="gnuplot",
           edgecolor="black", linewidth=0.5)

In [None]:
for i, (Model, kwargs, dimensions) in enumerate([(AgglomerativeClustering, {'n_clusters': 10}, 400), (AgglomerativeClustering, {'n_clusters': 5}, 10), (AgglomerativeClustering, {'n_clusters': 5}, 5)]):
    print(Model.__name__, kwargs, "with: ", dimensions)
    clf = TruncatedSVD(n_components=dimensions)
    Xpca = clf.fit_transform(X)
    model = Model(**kwargs)
    model.fit(Xpca)
    y_plot = pd.DataFrame(model.labels_)
    fig = plt.figure(i)
    ax = Axes3D(fig)
    ax.scatter(Xpca_plot[0], Xpca_plot[1], Xpca_plot[2], c=y_plot[0], s=50,
           cmap="gnuplot",
           edgecolor="white", linewidth=1)