# CA-2
GitHub Repo Link- https://github.com/faisal3325/ai2-ca2

## Q1 Running supervised ML for predicting
We have selected NMF classification data as it more accurately classified topics as compared to LDA. We found that the probability of words in a few topics were quite low based on the observations of the graphs.

In [80]:
# importing libraries
import pandas as pd # for dataframe
import time # for timer used to calculate execution time
import pickle # for exporting model

In [81]:
# Helper functions

# Timer functions for calculating execution time of models
def time_elapsed_start():
    # Starting the timer
    return time.time()

def time_elapsed_stop(start):
    # Stoping timer and calculating time in
    # seconds, minutes, and hours
    sec = time.time() - start
    min = sec/60
    hour = min/60
    print(f"Execution took {sec} seconds - ({min} minutes), ({hour} hours)")

def save_model(model, model_name):
    # Saving the model to disk
    filename = f'{model_name}.sav'
    pickle.dump(model, open(filename, 'wb'))

In [7]:
# loading data
df = pd.read_csv('quora_supervised.csv')

In [8]:
df.sample(5)

Unnamed: 0,question,topic_num,topic
34833,What are the pros and cons of choosing .com ve...,3,Finances/Earning Money
54082,What is the best node.js ide available for free?,0,Opinions
145534,Where can I get audiobooks for free in English?,6,"Learning, Programming, and Education"
32674,How can you find a Korean pen pal?,6,"Learning, Programming, and Education"
47009,What is funniest video you have ever watched?,0,Opinions


In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200000 entries, 0 to 199999
Data columns (total 3 columns):
question     200000 non-null object
topic_num    200000 non-null int64
topic        200000 non-null object
dtypes: int64(1), object(2)
memory usage: 4.6+ MB


Converting our questions to numeric form using TfidfVectorizer with 0.9 max_df and 5 min_df.

In [43]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(max_df=0.9, min_df=5, stop_words="english")
X_tfidf = tfidf.fit_transform(df['question'])
X_tfidf

<200000x15154 sparse matrix of type '<class 'numpy.float64'>'
	with 948262 stored elements in Compressed Sparse Row format>

Splitting data into training and testing with 20% testing and rest 80% for training.

In [44]:
from sklearn.model_selection import train_test_split
y = df['topic_num']
X_train, X_test, y_train, y_test = train_test_split(X_tfidf, y, test_size=0.2, random_state=0)

## LogisticRegression

In [45]:
from sklearn.linear_model import LogisticRegression

lin_reg_model = LogisticRegression(solver='lbfgs')
lin_reg_model.fit(X_train, y_train)



LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='lbfgs',
          tol=0.0001, verbose=0, warm_start=False)

In [46]:
from sklearn import metrics

lin_reg_model_predictions = lin_reg_model.predict(X_test)

In [47]:
lin_reg_model_predictions

array([3, 3, 6, ..., 9, 1, 9], dtype=int64)

### Confusion Matrix

In [48]:
print(metrics.confusion_matrix(y_test,lin_reg_model_predictions))

[[4340   32   15    7    5   62   17    7  109  266]
 [  29 4248   11    5   10   90   12   10   84  335]
 [  29   52 1610   10    8   36   12    7   90  176]
 [  36   34   10 1858    2   80   10    7   64  129]
 [  16   43    3    6 1708   38    8    3   43  127]
 [  53   65    7   10   19 4091   13   16  143  332]
 [  32   34    4    5    9   48 2041   15   91  174]
 [  15   43    6    2    6   79    5 2283   61  186]
 [  41   93   11   21   15  116   41    5 4828  281]
 [  46   78   58   18   23  124   23   29  141 8171]]


In [50]:
dataframe_labels = pd.DataFrame(metrics.confusion_matrix(y_test,lin_reg_model_predictions), 
                  index=['correct 1', 'correct 2', 'correct 3', 'correct 4', 'correct 5', 'correct 6', 'correct 7', 'correct 8', 'correct 9', 'correct 10'], 
                  columns=['predicted 1', 'predicted 2', 'predicted 3', 'predicted 4', 'predicted 5', 'predicted 6', 'predicted 7', 'predicted 8', 'predicted 9', 'predicted 10'])
dataframe_labels

Unnamed: 0,predicted 1,predicted 2,predicted 3,predicted 4,predicted 5,predicted 6,predicted 7,predicted 8,predicted 9,predicted 10
correct 1,4340,32,15,7,5,62,17,7,109,266
correct 2,29,4248,11,5,10,90,12,10,84,335
correct 3,29,52,1610,10,8,36,12,7,90,176
correct 4,36,34,10,1858,2,80,10,7,64,129
correct 5,16,43,3,6,1708,38,8,3,43,127
correct 6,53,65,7,10,19,4091,13,16,143,332
correct 7,32,34,4,5,9,48,2041,15,91,174
correct 8,15,43,6,2,6,79,5,2283,61,186
correct 9,41,93,11,21,15,116,41,5,4828,281
correct 10,46,78,58,18,23,124,23,29,141,8171


In [51]:
print(metrics.classification_report(y_test,lin_reg_model_predictions))

precision    recall  f1-score   support

           0       0.94      0.89      0.91      4860
           1       0.90      0.88      0.89      4834
           2       0.93      0.79      0.86      2030
           3       0.96      0.83      0.89      2230
           4       0.95      0.86      0.90      1995
           5       0.86      0.86      0.86      4749
           6       0.94      0.83      0.88      2453
           7       0.96      0.85      0.90      2686
           8       0.85      0.89      0.87      5452
           9       0.80      0.94      0.87      8711

   micro avg       0.88      0.88      0.88     40000
   macro avg       0.91      0.86      0.88     40000
weighted avg       0.88      0.88      0.88     40000



In [52]:
print(metrics.accuracy_score(y_test,lin_reg_model_predictions))

0.87945


Accuracy we got using LR- 87%

## Navie Bayes

In [53]:
from sklearn.naive_bayes import MultinomialNB

nb_model = MultinomialNB()

nb_model.fit(X_train, y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [83]:
nb_model_predictions = nb_model.predict(X_test)

### Confusing matrix

In [54]:
print(metrics.confusion_matrix(y_test,nb_model_predictions))

[[3608   61   41   31   18  159   96   10  328  508]
 [  84 3347   43   33   11  228   38   53  260  737]
 [ 102   93 1226   19    5   40   31   21  130  363]
 [ 224   82    8 1109    9  222   31   16  170  359]
 [  56  104    7   10 1091   89   13   10  110  505]
 [ 194  160    5   24    8 3428   27   46  320  537]
 [ 150   81    4    5    5   84 1625    6  256  237]
 [ 114  115   16   10    9  206   22 1532  200  462]
 [ 265  209   12   24   12  161   65   24 4021  659]
 [ 197  243   44   36   23  273   48   59  266 7522]]


In [71]:
dataframe_labels = pd.DataFrame(metrics.confusion_matrix(y_test,nb_model_predictions), 
                  index=['correct 1', 'correct 2', 'correct 3', 'correct 4', 'correct 5', 'correct 6', 'correct 7', 'correct 8', 'correct 9', 'correct 10'], 
                  columns=['predicted 1', 'predicted 2', 'predicted 3', 'predicted 4', 'predicted 5', 'predicted 6', 'predicted 7', 'predicted 8', 'predicted 9', 'predicted 10'])
dataframe_labels

Unnamed: 0,predicted 1,predicted 2,predicted 3,predicted 4,predicted 5,predicted 6,predicted 7,predicted 8,predicted 9,predicted 10
correct 1,3608,61,41,31,18,159,96,10,328,508
correct 2,84,3347,43,33,11,228,38,53,260,737
correct 3,102,93,1226,19,5,40,31,21,130,363
correct 4,224,82,8,1109,9,222,31,16,170,359
correct 5,56,104,7,10,1091,89,13,10,110,505
correct 6,194,160,5,24,8,3428,27,46,320,537
correct 7,150,81,4,5,5,84,1625,6,256,237
correct 8,114,115,16,10,9,206,22,1532,200,462
correct 9,265,209,12,24,12,161,65,24,4021,659
correct 10,197,243,44,36,23,273,48,59,266,7522


In [56]:
print(metrics.classification_report(y_test,nb_model_predictions))

precision    recall  f1-score   support

           0       0.72      0.74      0.73      4860
           1       0.74      0.69      0.72      4834
           2       0.87      0.60      0.71      2030
           3       0.85      0.50      0.63      2230
           4       0.92      0.55      0.68      1995
           5       0.70      0.72      0.71      4749
           6       0.81      0.66      0.73      2453
           7       0.86      0.57      0.69      2686
           8       0.66      0.74      0.70      5452
           9       0.63      0.86      0.73      8711

   micro avg       0.71      0.71      0.71     40000
   macro avg       0.78      0.66      0.70     40000
weighted avg       0.73      0.71      0.71     40000



In [57]:
print(metrics.accuracy_score(y_test,nb_model_predictions))

0.712725


Accuracy we got using NavieBayes- 71%

## Random Forest Classifier

In [67]:
from sklearn.ensemble import RandomForestClassifier

rf_model = RandomForestClassifier()

start = time_elapsed_start()
rf_model.fit(X_train, y_train)
time_elapsed_stop(start)

Execution took 70.71564793586731 seconds - (1.178594132264455 minutes), (0.01964323553774092 hours)


In [84]:
rf_model_predictions = rf_model.predict(X_test)

In [68]:
print(metrics.confusion_matrix(y_test,rf_model_predictions))

[[4614   31   13   14    5   35   24    7   66   51]
 [ 134 4334   29   22   24   58   14   15   67  137]
 [  43   29 1812   15    6   23    7    1   35   59]
 [  45   26   12 2054    3   26    4   11   17   32]
 [  26   23    2    5 1868   11    3    7   20   30]
 [ 135   78   15   33   38 4129   17   26  107  171]
 [  52   25   17   11    9   28 2179   11   55   66]
 [  37   31    8    4    8   46   10 2462   25   55]
 [ 159   81   25   41   39  112   46   20 4802  127]
 [ 157  118  121   41   48  146   42   49  174 7815]]


### Confusion matrix

In [72]:
dataframe_labels = pd.DataFrame(metrics.confusion_matrix(y_test,rf_model_predictions), 
                  index=['correct 1', 'correct 2', 'correct 3', 'correct 4', 'correct 5', 'correct 6', 'correct 7', 'correct 8', 'correct 9', 'correct 10'], 
                  columns=['predicted 1', 'predicted 2', 'predicted 3', 'predicted 4', 'predicted 5', 'predicted 6', 'predicted 7', 'predicted 8', 'predicted 9', 'predicted 10'])
dataframe_labels

Unnamed: 0,predicted 1,predicted 2,predicted 3,predicted 4,predicted 5,predicted 6,predicted 7,predicted 8,predicted 9,predicted 10
correct 1,4614,31,13,14,5,35,24,7,66,51
correct 2,134,4334,29,22,24,58,14,15,67,137
correct 3,43,29,1812,15,6,23,7,1,35,59
correct 4,45,26,12,2054,3,26,4,11,17,32
correct 5,26,23,2,5,1868,11,3,7,20,30
correct 6,135,78,15,33,38,4129,17,26,107,171
correct 7,52,25,17,11,9,28,2179,11,55,66
correct 8,37,31,8,4,8,46,10,2462,25,55
correct 9,159,81,25,41,39,112,46,20,4802,127
correct 10,157,118,121,41,48,146,42,49,174,7815


In [70]:
print(metrics.accuracy_score(y_test,rf_model_predictions))

0.901725


Accuracy we got using Random Forest- 90%

## Support Vector Classifier

In [73]:
from sklearn.svm import SVC

svc_model = SVC(gamma="auto")

start = time_elapsed_start()
svc_model.fit(X_train, y_train)
time_elapsed_stop(start)

svc_model_predictions = svc_model.predict(X_test)

print(metrics.confusion_matrix(y_test, svc_model_predictions))

Execution took 5735.429756402969 seconds - (95.5904959400495 minutes), (1.5931749323341582 hours)
[[   0    0    0    0    0    0    0    0    0 4860]
 [   0    0    0    0    0    0    0    0    0 4834]
 [   0    0    0    0    0    0    0    0    0 2030]
 [   0    0    0    0    0    0    0    0    0 2230]
 [   0    0    0    0    0    0    0    0    0 1995]
 [   0    0    0    0    0    0    0    0    0 4749]
 [   0    0    0    0    0    0    0    0    0 2453]
 [   0    0    0    0    0    0    0    0    0 2686]
 [   0    0    0    0    0    0    0    0    0 5452]
 [   0    0    0    0    0    0    0    0    0 8711]]


In [85]:
svc_model_predictions = svc_model.predict(X_test)

### Confusion Matrix

In [86]:
print(metrics.confusion_matrix(y_test, svc_model_predictions))

[[   0    0    0    0    0    0    0    0    0 4860]
 [   0    0    0    0    0    0    0    0    0 4834]
 [   0    0    0    0    0    0    0    0    0 2030]
 [   0    0    0    0    0    0    0    0    0 2230]
 [   0    0    0    0    0    0    0    0    0 1995]
 [   0    0    0    0    0    0    0    0    0 4749]
 [   0    0    0    0    0    0    0    0    0 2453]
 [   0    0    0    0    0    0    0    0    0 2686]
 [   0    0    0    0    0    0    0    0    0 5452]
 [   0    0    0    0    0    0    0    0    0 8711]]


In [82]:
save_model(svc_model, 'svc_model')

In [74]:
# You can make the confusion matrix less confusing by adding labels:
dataframe_labels = pd.DataFrame(metrics.confusion_matrix(y_test,svc_model_predictions), 
                  index=['correct 1', 'correct 2', 'correct 3', 'correct 4', 'correct 5', 'correct 6', 'correct 7', 'correct 8', 'correct 9', 'correct 10'], 
                  columns=['predicted 1', 'predicted 2', 'predicted 3', 'predicted 4', 'predicted 5', 'predicted 6', 'predicted 7', 'predicted 8', 'predicted 9', 'predicted 10'])
dataframe_labels

Unnamed: 0,predicted 1,predicted 2,predicted 3,predicted 4,predicted 5,predicted 6,predicted 7,predicted 8,predicted 9,predicted 10
correct 1,0,0,0,0,0,0,0,0,0,4860
correct 2,0,0,0,0,0,0,0,0,0,4834
correct 3,0,0,0,0,0,0,0,0,0,2030
correct 4,0,0,0,0,0,0,0,0,0,2230
correct 5,0,0,0,0,0,0,0,0,0,1995
correct 6,0,0,0,0,0,0,0,0,0,4749
correct 7,0,0,0,0,0,0,0,0,0,2453
correct 8,0,0,0,0,0,0,0,0,0,2686
correct 9,0,0,0,0,0,0,0,0,0,5452
correct 10,0,0,0,0,0,0,0,0,0,8711


In [75]:
print(metrics.classification_report(y_test,svc_model_predictions))

precision    recall  f1-score   support

           0       0.00      0.00      0.00      4860
           1       0.00      0.00      0.00      4834
           2       0.00      0.00      0.00      2030
           3       0.00      0.00      0.00      2230
           4       0.00      0.00      0.00      1995
           5       0.00      0.00      0.00      4749
           6       0.00      0.00      0.00      2453
           7       0.00      0.00      0.00      2686
           8       0.00      0.00      0.00      5452
           9       0.22      1.00      0.36      8711

   micro avg       0.22      0.22      0.22     40000
   macro avg       0.02      0.10      0.04     40000
weighted avg       0.05      0.22      0.08     40000

  'precision', 'predicted', average, warn_for)


In [77]:
print(metrics.accuracy_score(y_test,svc_model_predictions))

0.217775


Accuracy of SVC- 21%

We have noticed that Random Forest has the highest accuracy with 90%.