### Model Notebook
The notebook conatains the code for building the Machine Learning model for Flair Prediction.


### Importing required Libraries

In [2]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn import model_selection, preprocessing, linear_model, naive_bayes, metrics, svm
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer, TfidfTransformer
from sklearn import decomposition, ensemble
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report
import warnings
warnings.filterwarnings('ignore')

### Reading the data using pandas csv method

df = pd.read_csv('reddit_data.csv')

### Displaying the top 5 rows of the data

In [4]:
df.head()

Unnamed: 0,title,score,id,url,comms_num,created,body,author,flair,comment,authors,feature_combine
0,men 30+ decided get married plan old age,257,fvy95j,https://www.reddit.com/r/india/comments/fvy95j...,206,1586207000.0,corona virus given time think life choices bit...,indianoogler,AskIndia,plan finances work enjoy ways healthy go see w...,RedDevil-84 congratsindia khushraho kingof-po...,men 30+ decided get married plan old ageplan f...
1,plan 5th april 9pm switch lights play bella ci...,125,fv8lt3,https://www.reddit.com/r/india/comments/fv8lt3...,82,1586093000.0,dont like worship leaders without logic sympat...,ioup568,AskIndia,made whatsapp forward post anyone interestedpl...,AnonymousSeeker5 TablePrime69 faceofnobody no...,plan 5th april 9pm switch lights play bella ci...
2,got scammed today foolishness give advice im r...,217,ftjihj,https://www.reddit.com/r/india/comments/ftjihj...,75,1585849000.0,info meim 12th class recently registered googl...,Momos-,AskIndia,literally every message every bank says share ...,saadakhtar SerpantSociety BombayCynic iphone4...,got scammed today foolishness give advice im r...
3,serious friends doctor husband become paranoid...,149,fu3f8b,https://www.reddit.com/r/india/comments/fu3f8b...,76,1585925000.0,doctor says 100 sure virus spreading via veget...,wordswithmagic,AskIndia,1 research study 2013 https linkspringercom ar...,Neglectedsince1994 captainobvioushuman adga49...,serious friends doctor husband become paranoid...
4,employer making come office even subtle pressu...,308,fjx0dq,https://www.reddit.com/r/india/comments/fjx0dq...,113,1584439000.0,fill form 100 anonymous take less minute https...,pensker,AskIndia,good luck employer act one two people die lol ...,ekkanpuriya Death_Pig nuvo_reddit goldyprag s...,employer making come office even subtle pressu...


### Python List of the flair tags

In [5]:
flair_tags = ["AskIndia", "Unverified", "Non-Political", 
          "Scheduled", "Photography", "Science/Technology",
          "Politics", "Business/Finance", "Policy/Economy",
          "Sports", "Food", "[R]eddiquette"]

### Replacing all the null values with ""

In [6]:
df.fillna("",inplace = True)


### Trying out different models using different features of the dataset

### Naive Bayes Classifier
I have used the naive bayes classifier as the first model. I have defined the pipeline as the data is text so using the Pipeline method I have converted the data using Count Vectorizer and TF-IDF transform as to trained different features on the classifier.<br>
I have used the follwing features to train the model
1. Title
2. URL
3. Body
4. Comment
5. Feature Combine

In [7]:
def naive_bayes_classifier(X_train, X_val, y_train, y_val):
    
    nb_classifier = Pipeline([('vect', CountVectorizer()),
                              ('tfidf', TfidfTransformer()),
                              ('clf', MultinomialNB()),
                             ])
    nb_classifier.fit(X_train, y_train)
    y_pred = nb_classifier.predict(X_val)
    
    print('accuracy %s' % metrics.accuracy_score(y_pred, y_val))
    print(classification_report(y_val, y_pred,target_names=flair_tags))

In [8]:
X = df.title
y = df.flair
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size = 0.2, random_state=42)
naive_bayes_classifier(X_train, X_val, y_train, y_val)

accuracy 0.41106719367588934
                    precision    recall  f1-score   support

          AskIndia       0.34      1.00      0.50       148
        Unverified       0.75      0.11      0.19        28
     Non-Political       0.00      0.00      0.00        33
         Scheduled       1.00      0.17      0.29        54
       Photography       1.00      0.23      0.38        30
Science/Technology       0.00      0.00      0.00        37
          Politics       1.00      0.04      0.07        28
  Business/Finance       0.90      0.97      0.94        39
    Policy/Economy       0.00      0.00      0.00        30
            Sports       1.00      0.03      0.07        29
              Food       1.00      0.04      0.07        26
     [R]eddiquette       0.00      0.00      0.00        24

          accuracy                           0.41       506
         macro avg       0.58      0.22      0.21       506
      weighted avg       0.54      0.41      0.29       506



In [9]:
X = df.url
y = df.flair
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size = 0.2, random_state=42)
naive_bayes_classifier(X_train, X_val, y_train, y_val)

accuracy 0.4624505928853755
                    precision    recall  f1-score   support

          AskIndia       0.46      1.00      0.63       148
        Unverified       0.48      0.39      0.43        28
     Non-Political       0.00      0.00      0.00        33
         Scheduled       0.44      0.44      0.44        54
       Photography       0.38      0.90      0.53        30
Science/Technology       0.50      0.05      0.10        37
          Politics       0.33      0.04      0.06        28
  Business/Finance       1.00      0.05      0.10        39
    Policy/Economy       1.00      0.13      0.24        30
            Sports       0.79      0.52      0.62        29
              Food       0.00      0.00      0.00        26
     [R]eddiquette       0.00      0.00      0.00        24

          accuracy                           0.46       506
         macro avg       0.45      0.29      0.26       506
      weighted avg       0.47      0.46      0.35       506



In [10]:
X = df.body
y = df.flair
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size = 0.2, random_state=42)
naive_bayes_classifier(X_train, X_val, y_train, y_val)

accuracy 0.3695652173913043
                    precision    recall  f1-score   support

          AskIndia       0.32      1.00      0.48       148
        Unverified       0.00      0.00      0.00        28
     Non-Political       0.00      0.00      0.00        33
         Scheduled       0.00      0.00      0.00        54
       Photography       0.00      0.00      0.00        30
Science/Technology       0.00      0.00      0.00        37
          Politics       0.00      0.00      0.00        28
  Business/Finance       0.95      0.97      0.96        39
    Policy/Economy       0.00      0.00      0.00        30
            Sports       0.00      0.00      0.00        29
              Food       0.00      0.00      0.00        26
     [R]eddiquette       1.00      0.04      0.08        24

          accuracy                           0.37       506
         macro avg       0.19      0.17      0.13       506
      weighted avg       0.21      0.37      0.22       506



In [11]:
X = df.comment
y = df.flair
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size = 0.2, random_state=42)
naive_bayes_classifier(X_train, X_val, y_train, y_val)

accuracy 0.3142292490118577
                    precision    recall  f1-score   support

          AskIndia       0.30      1.00      0.46       148
        Unverified       0.00      0.00      0.00        28
     Non-Political       0.00      0.00      0.00        33
         Scheduled       0.00      0.00      0.00        54
       Photography       1.00      0.03      0.06        30
Science/Technology       0.00      0.00      0.00        37
          Politics       0.00      0.00      0.00        28
  Business/Finance       0.91      0.26      0.40        39
    Policy/Economy       0.00      0.00      0.00        30
            Sports       0.00      0.00      0.00        29
              Food       0.00      0.00      0.00        26
     [R]eddiquette       0.00      0.00      0.00        24

          accuracy                           0.31       506
         macro avg       0.18      0.11      0.08       506
      weighted avg       0.22      0.31      0.17       506



In [12]:
X = df.feature_combine
y = df.flair
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size = 0.2, random_state=42)
naive_bayes_classifier(X_train, X_val, y_train, y_val)

accuracy 0.3201581027667984
                    precision    recall  f1-score   support

          AskIndia       0.30      1.00      0.46       148
        Unverified       0.00      0.00      0.00        28
     Non-Political       0.00      0.00      0.00        33
         Scheduled       0.00      0.00      0.00        54
       Photography       1.00      0.03      0.06        30
Science/Technology       0.00      0.00      0.00        37
          Politics       0.00      0.00      0.00        28
  Business/Finance       1.00      0.33      0.50        39
    Policy/Economy       0.00      0.00      0.00        30
            Sports       0.00      0.00      0.00        29
              Food       0.00      0.00      0.00        26
     [R]eddiquette       0.00      0.00      0.00        24

          accuracy                           0.32       506
         macro avg       0.19      0.11      0.09       506
      weighted avg       0.22      0.32      0.18       506



### Logistic Regression Classifier
I have used the Logistic Regression Classifier as the second model. I have defined the pipeline as the data is text so using the Pipeline method. I have converted the data using Count Vectorizer and TF-IDF transform as to trained different features on the classifier. I have used different values of C and the maximum iteration but got the best accuracy at C = 1e6 and maximum iteration 200.<br>
I have used the follwing features to train the model
1. Title
2. URL
3. Body
4. Comment
5. Feature Combine

In [13]:
def logisticreg_classifier(X_train, X_val, y_train, y_val):
    
    logreg_classifier = Pipeline([('vect', CountVectorizer()),
                                  ('tfidf', TfidfTransformer()),
                                  ('clf', LogisticRegression(n_jobs=1, max_iter = 200, C=1e6))
                                 ])
    logreg_classifier.fit(X_train, y_train)

    y_pred = logreg_classifier.predict(X_val)

    print('accuracy %s' % metrics.accuracy_score(y_pred, y_val))
    print(classification_report(y_val, y_pred,target_names=flair_tags))

In [14]:
X = df.title
y = df.flair
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size = 0.1, random_state=42)
logisticreg_classifier(X_train, X_val, y_train, y_val)

accuracy 0.6798418972332015
                    precision    recall  f1-score   support

          AskIndia       0.72      1.00      0.84        65
        Unverified       0.50      0.60      0.55        15
     Non-Political       0.33      0.40      0.36        15
         Scheduled       0.76      0.85      0.80        26
       Photography       0.67      0.71      0.69        14
Science/Technology       0.50      0.11      0.18        18
          Politics       0.55      0.35      0.43        17
  Business/Finance       0.95      1.00      0.98        20
    Policy/Economy       0.67      0.50      0.57        16
            Sports       0.82      0.78      0.80        18
              Food       0.50      0.28      0.36        18
     [R]eddiquette       0.62      0.45      0.53        11

          accuracy                           0.68       253
         macro avg       0.63      0.59      0.59       253
      weighted avg       0.66      0.68      0.65       253



In [15]:
X = df.url
y = df.flair
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size = 0.1, random_state=42)
logisticreg_classifier(X_train, X_val, y_train, y_val)

accuracy 0.5691699604743083
                    precision    recall  f1-score   support

          AskIndia       0.60      1.00      0.75        65
        Unverified       0.43      0.40      0.41        15
     Non-Political       0.00      0.00      0.00        15
         Scheduled       0.92      0.85      0.88        26
       Photography       0.32      1.00      0.48        14
Science/Technology       0.18      0.11      0.14        18
          Politics       0.50      0.12      0.19        17
  Business/Finance       0.94      0.75      0.83        20
    Policy/Economy       0.30      0.19      0.23        16
            Sports       0.71      0.67      0.69        18
              Food       0.00      0.00      0.00        18
     [R]eddiquette       0.75      0.27      0.40        11

          accuracy                           0.57       253
         macro avg       0.47      0.45      0.42       253
      weighted avg       0.51      0.57      0.50       253



In [16]:
X = df.comment
y = df.flair
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size = 0.1, random_state=42)
logisticreg_classifier(X_train, X_val, y_train, y_val)

accuracy 0.6679841897233202
                    precision    recall  f1-score   support

          AskIndia       0.83      0.97      0.89        65
        Unverified       0.62      0.53      0.57        15
     Non-Political       0.62      0.53      0.57        15
         Scheduled       0.70      0.81      0.75        26
       Photography       0.89      0.57      0.70        14
Science/Technology       0.33      0.22      0.27        18
          Politics       0.57      0.47      0.52        17
  Business/Finance       0.87      1.00      0.93        20
    Policy/Economy       1.00      0.19      0.32        16
            Sports       0.59      0.72      0.65        18
              Food       0.38      0.50      0.43        18
     [R]eddiquette       0.29      0.36      0.32        11

          accuracy                           0.67       253
         macro avg       0.64      0.57      0.58       253
      weighted avg       0.68      0.67      0.65       253



In [17]:
X = df.body
y = df.flair
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size = 0.1, random_state=42)
logisticreg_classifier(X_train, X_val, y_train, y_val)

accuracy 0.48221343873517786
                    precision    recall  f1-score   support

          AskIndia       0.85      0.94      0.89        65
        Unverified       1.00      0.07      0.12        15
     Non-Political       0.83      0.33      0.48        15
         Scheduled       0.20      1.00      0.33        26
       Photography       0.00      0.00      0.00        14
Science/Technology       0.62      0.28      0.38        18
          Politics       1.00      0.06      0.11        17
  Business/Finance       0.95      1.00      0.98        20
    Policy/Economy       0.00      0.00      0.00        16
            Sports       0.00      0.00      0.00        18
              Food       0.14      0.06      0.08        18
     [R]eddiquette       0.50      0.18      0.27        11

          accuracy                           0.48       253
         macro avg       0.51      0.33      0.30       253
      weighted avg       0.57      0.48      0.43       253



In [18]:
X = df.feature_combine
y = df.flair
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size = 0.1, random_state=42)
logisticreg_classifier(X_train, X_val, y_train, y_val)

accuracy 0.7272727272727273
                    precision    recall  f1-score   support

          AskIndia       0.79      0.97      0.87        65
        Unverified       0.53      0.53      0.53        15
     Non-Political       0.69      0.60      0.64        15
         Scheduled       0.75      0.81      0.78        26
       Photography       0.80      0.86      0.83        14
Science/Technology       0.36      0.28      0.31        18
          Politics       0.64      0.53      0.58        17
  Business/Finance       0.95      1.00      0.98        20
    Policy/Economy       0.70      0.44      0.54        16
            Sports       0.95      1.00      0.97        18
              Food       0.54      0.39      0.45        18
     [R]eddiquette       0.45      0.45      0.45        11

          accuracy                           0.73       253
         macro avg       0.68      0.65      0.66       253
      weighted avg       0.71      0.73      0.71       253



### Logistic SVM (Support Vector Machine) Classifier
I have used the SVM as the third model. I have defined the pipeline as the data is text so using the Pipeline method. I have converted the data using Count Vectorizer and TF-IDF transform as to trained different features on the classifier. I have tried to train the model by changing the loss function and the alpha value. Got the best accuracy at the defined values in the function. <br>
I have used the follwing features to train the model
1. Title
2. URL
3. Body
4. Comment
5. Feature Combine

In [19]:
def linear_svm_classifier(X_train, X_val, y_train, y_val):
    
    sgd_classifier = Pipeline([('vect', CountVectorizer()),
                               ('tfidf', TfidfTransformer()),
                               ('clf', linear_model.SGDClassifier(loss='hinge', penalty='l2', alpha=1e-5, random_state=42, max_iter=100, tol=None)),
                              ])
    sgd_classifier.fit(X_train, y_train)
    y_pred = sgd_classifier.predict(X_val)
    
    print('accuracy %s' % metrics.accuracy_score(y_pred, y_val))
    print(classification_report(y_val, y_pred,target_names=flair_tags))

In [20]:
X = df.title
y = df.flair
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size = 0.1, random_state=42)
linear_svm_classifier(X_train, X_val, y_train, y_val)

accuracy 0.6482213438735178
                    precision    recall  f1-score   support

          AskIndia       0.76      1.00      0.86        65
        Unverified       0.35      0.40      0.38        15
     Non-Political       0.50      0.40      0.44        15
         Scheduled       0.63      0.85      0.72        26
       Photography       0.67      0.71      0.69        14
Science/Technology       0.30      0.17      0.21        18
          Politics       0.62      0.29      0.40        17
  Business/Finance       0.95      1.00      0.98        20
    Policy/Economy       0.55      0.38      0.44        16
            Sports       0.64      0.78      0.70        18
              Food       0.31      0.22      0.26        18
     [R]eddiquette       1.00      0.27      0.43        11

          accuracy                           0.65       253
         macro avg       0.61      0.54      0.54       253
      weighted avg       0.63      0.65      0.62       253



In [21]:
X = df.url
y = df.flair
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size = 0.1, random_state=42)
linear_svm_classifier(X_train, X_val, y_train, y_val)

accuracy 0.5652173913043478
                    precision    recall  f1-score   support

          AskIndia       0.60      1.00      0.75        65
        Unverified       0.40      0.40      0.40        15
     Non-Political       0.00      0.00      0.00        15
         Scheduled       0.85      0.85      0.85        26
       Photography       0.32      0.93      0.47        14
Science/Technology       0.18      0.11      0.14        18
          Politics       0.33      0.06      0.10        17
  Business/Finance       0.94      0.75      0.83        20
    Policy/Economy       0.43      0.19      0.26        16
            Sports       0.71      0.67      0.69        18
              Food       0.25      0.06      0.09        18
     [R]eddiquette       0.60      0.27      0.37        11

          accuracy                           0.57       253
         macro avg       0.47      0.44      0.41       253
      weighted avg       0.51      0.57      0.50       253



In [22]:
X = df.comment
y = df.flair
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size = 0.1, random_state=42)
linear_svm_classifier(X_train, X_val, y_train, y_val)

accuracy 0.6521739130434783
                    precision    recall  f1-score   support

          AskIndia       0.74      0.97      0.84        65
        Unverified       0.36      0.27      0.31        15
     Non-Political       0.69      0.60      0.64        15
         Scheduled       0.81      0.81      0.81        26
       Photography       0.75      0.64      0.69        14
Science/Technology       0.26      0.28      0.27        18
          Politics       0.56      0.53      0.55        17
  Business/Finance       0.95      1.00      0.98        20
    Policy/Economy       1.00      0.19      0.32        16
            Sports       0.64      0.89      0.74        18
              Food       0.22      0.11      0.15        18
     [R]eddiquette       0.31      0.36      0.33        11

          accuracy                           0.65       253
         macro avg       0.61      0.55      0.55       253
      weighted avg       0.65      0.65      0.62       253



In [23]:
X = df.body
y = df.flair
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size = 0.1, random_state=42)
linear_svm_classifier(X_train, X_val, y_train, y_val)

accuracy 0.4743083003952569
                    precision    recall  f1-score   support

          AskIndia       0.82      0.94      0.88        65
        Unverified       0.50      0.07      0.12        15
     Non-Political       0.83      0.33      0.48        15
         Scheduled       1.00      0.19      0.32        26
       Photography       0.00      0.00      0.00        14
Science/Technology       0.80      0.22      0.35        18
          Politics       1.00      0.06      0.11        17
  Business/Finance       0.95      1.00      0.98        20
    Policy/Economy       0.00      0.00      0.00        16
            Sports       0.14      1.00      0.25        18
              Food       0.33      0.17      0.22        18
     [R]eddiquette       0.40      0.18      0.25        11

          accuracy                           0.47       253
         macro avg       0.57      0.35      0.33       253
      weighted avg       0.64      0.47      0.45       253



In [24]:
X = df.feature_combine
y = df.flair
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size = 0.1, random_state=42)
linear_svm_classifier(X_train, X_val, y_train, y_val)

accuracy 0.7312252964426877
                    precision    recall  f1-score   support

          AskIndia       0.83      0.97      0.89        65
        Unverified       0.53      0.60      0.56        15
     Non-Political       0.69      0.60      0.64        15
         Scheduled       0.73      0.85      0.79        26
       Photography       0.71      0.86      0.77        14
Science/Technology       0.43      0.33      0.38        18
          Politics       0.67      0.47      0.55        17
  Business/Finance       0.95      1.00      0.98        20
    Policy/Economy       0.67      0.38      0.48        16
            Sports       0.90      1.00      0.95        18
              Food       0.53      0.44      0.48        18
     [R]eddiquette       0.44      0.36      0.40        11

          accuracy                           0.73       253
         macro avg       0.67      0.65      0.66       253
      weighted avg       0.71      0.73      0.72       253



### Random Forest Classifier
I have used the Random Forest Classifier as the fourth model. I have defined the pipeline as the data is text so using the Pipeline method. I have converted the data using Count Vectorizer and TF-IDF transform as to trained different features on the classifier. <br>
I have used the follwing features to train the model
1. Title
2. URL
3. Body
4. Comment
5. Feature Combine

In [25]:
def randomforest_classifier(X_train, X_val, y_train, y_val):
    
    randomforest_clf = Pipeline([('vect', CountVectorizer()),
                  ('tfidf', TfidfTransformer()),
                  ('clf', ensemble.RandomForestClassifier(n_estimators = 1000, random_state = 42)),
                 ])
    randomforest_clf.fit(X_train, y_train)
    
    y_pred = randomforest_clf.predict(X_val)
    
    print('accuracy %s' % metrics.accuracy_score(y_pred, y_val))
    print(classification_report(y_val, y_pred,target_names=flair_tags))

In [26]:
X = df.title
y = df.flair
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size = 0.1, random_state=42)
randomforest_classifier(X_train, X_val, y_train, y_val)

accuracy 0.6086956521739131
                    precision    recall  f1-score   support

          AskIndia       0.61      0.97      0.75        65
        Unverified       0.43      0.40      0.41        15
     Non-Political       0.71      0.33      0.45        15
         Scheduled       0.49      0.88      0.63        26
       Photography       0.62      0.71      0.67        14
Science/Technology       0.80      0.22      0.35        18
          Politics       0.50      0.18      0.26        17
  Business/Finance       0.95      1.00      0.98        20
    Policy/Economy       0.56      0.31      0.40        16
            Sports       1.00      0.61      0.76        18
              Food       0.43      0.17      0.24        18
     [R]eddiquette       0.14      0.09      0.11        11

          accuracy                           0.61       253
         macro avg       0.60      0.49      0.50       253
      weighted avg       0.62      0.61      0.57       253



In [27]:
X = df.url
y = df.flair
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size = 0.1, random_state=42)
randomforest_classifier(X_train, X_val, y_train, y_val)

accuracy 0.5296442687747036
                    precision    recall  f1-score   support

          AskIndia       0.70      0.97      0.81        65
        Unverified       0.42      0.33      0.37        15
     Non-Political       0.00      0.00      0.00        15
         Scheduled       0.49      0.85      0.62        26
       Photography       0.00      0.00      0.00        14
Science/Technology       0.20      0.06      0.09        18
          Politics       0.00      0.00      0.00        17
  Business/Finance       0.33      1.00      0.49        20
    Policy/Economy       0.44      0.44      0.44        16
            Sports       0.74      0.78      0.76        18
              Food       0.00      0.00      0.00        18
     [R]eddiquette       1.00      0.18      0.31        11

          accuracy                           0.53       253
         macro avg       0.36      0.38      0.32       253
      weighted avg       0.42      0.53      0.43       253



In [28]:
X = df.comment
y = df.flair
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size = 0.1, random_state=42)
randomforest_classifier(X_train, X_val, y_train, y_val)

accuracy 0.5177865612648221
                    precision    recall  f1-score   support

          AskIndia       0.43      1.00      0.60        65
        Unverified       0.71      0.33      0.45        15
     Non-Political       0.75      0.20      0.32        15
         Scheduled       1.00      0.58      0.73        26
       Photography       0.62      0.36      0.45        14
Science/Technology       1.00      0.06      0.11        18
          Politics       1.00      0.06      0.11        17
  Business/Finance       0.87      1.00      0.93        20
    Policy/Economy       0.67      0.12      0.21        16
            Sports       0.32      0.67      0.44        18
              Food       1.00      0.06      0.11        18
     [R]eddiquette       1.00      0.09      0.17        11

          accuracy                           0.52       253
         macro avg       0.78      0.38      0.39       253
      weighted avg       0.72      0.52      0.45       253



In [29]:
X = df.body
y = df.flair
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size = 0.1, random_state=42)
randomforest_classifier(X_train, X_val, y_train, y_val)

accuracy 0.43478260869565216
                    precision    recall  f1-score   support

          AskIndia       0.64      0.97      0.77        65
        Unverified       0.00      0.00      0.00        15
     Non-Political       0.00      0.00      0.00        15
         Scheduled       0.19      0.96      0.32        26
       Photography       0.00      0.00      0.00        14
Science/Technology       1.00      0.06      0.11        18
          Politics       0.00      0.00      0.00        17
  Business/Finance       0.95      1.00      0.98        20
    Policy/Economy       0.00      0.00      0.00        16
            Sports       0.00      0.00      0.00        18
              Food       0.00      0.00      0.00        18
     [R]eddiquette       0.50      0.09      0.15        11

          accuracy                           0.43       253
         macro avg       0.27      0.26      0.19       253
      weighted avg       0.35      0.43      0.32       253



In [30]:
X = df.feature_combine
y = df.flair
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size = 0.1, random_state=42)
randomforest_classifier(X_train, X_val, y_train, y_val)

accuracy 0.5849802371541502
                    precision    recall  f1-score   support

          AskIndia       0.51      1.00      0.67        65
        Unverified       0.50      0.33      0.40        15
     Non-Political       0.40      0.13      0.20        15
         Scheduled       0.86      0.69      0.77        26
       Photography       0.46      0.93      0.62        14
Science/Technology       0.33      0.06      0.10        18
          Politics       1.00      0.24      0.38        17
  Business/Finance       0.95      1.00      0.98        20
    Policy/Economy       1.00      0.25      0.40        16
            Sports       0.56      0.83      0.67        18
              Food       0.00      0.00      0.00        18
     [R]eddiquette       0.50      0.09      0.15        11

          accuracy                           0.58       253
         macro avg       0.59      0.46      0.44       253
      weighted avg       0.59      0.58      0.51       253



### MLP Classifier
I have used the MLP Classifier as the fifth model. I have defined the pipeline as the data is text so using the Pipeline method. I have converted the data using Count Vectorizer and TF-IDF transform as to trained different features on the classifier. I have tried to train the model using different hidden layers and the activation function. <br>
I have used the follwing features to train the model
1. Title
2. URL
3. Body
4. Comment
5. Feature Combine

In [31]:
from sklearn.neural_network import MLPClassifier
def mlp_classifier(X_train, X_val, y_train, y_val):  
    mlp_classifier = Pipeline([('vect', CountVectorizer()),
                  ('tfidf', TfidfTransformer()),
                  ('clf', MLPClassifier(hidden_layer_sizes=(30,40,30), activation="relu", max_iter = 200 )),
                 ])
    
    mlp_classifier.fit(X_train, y_train)
    
    y_pred = mlp_classifier.predict(X_val)
    
    print('accuracy %s' % metrics.accuracy_score(y_pred, y_val))
    print(classification_report(y_val, y_pred,target_names=flair_tags))

In [32]:
X = df.title
y = df.flair
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size = 0.1, random_state=42)
mlp_classifier(X_train, X_val, y_train, y_val)

accuracy 0.6442687747035574
                    precision    recall  f1-score   support

          AskIndia       0.80      1.00      0.89        65
        Unverified       0.39      0.47      0.42        15
     Non-Political       0.43      0.60      0.50        15
         Scheduled       0.81      0.85      0.83        26
       Photography       0.75      0.64      0.69        14
Science/Technology       0.33      0.28      0.30        18
          Politics       0.43      0.18      0.25        17
  Business/Finance       0.95      1.00      0.98        20
    Policy/Economy       0.47      0.44      0.45        16
            Sports       0.83      0.56      0.67        18
              Food       0.18      0.17      0.17        18
     [R]eddiquette       0.43      0.27      0.33        11

          accuracy                           0.64       253
         macro avg       0.57      0.54      0.54       253
      weighted avg       0.63      0.64      0.63       253



In [33]:
X = df.url
y = df.flair
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size = 0.1, random_state=42)
mlp_classifier(X_train, X_val, y_train, y_val)

accuracy 0.5612648221343873
                    precision    recall  f1-score   support

          AskIndia       1.00      0.97      0.98        65
        Unverified       0.43      0.40      0.41        15
     Non-Political       0.00      0.00      0.00        15
         Scheduled       0.91      0.81      0.86        26
       Photography       0.30      0.93      0.46        14
Science/Technology       0.27      0.17      0.21        18
          Politics       0.33      0.18      0.23        17
  Business/Finance       0.94      0.75      0.83        20
    Policy/Economy       1.00      0.06      0.12        16
            Sports       0.75      0.67      0.71        18
              Food       0.00      0.00      0.00        18
     [R]eddiquette       0.10      0.45      0.17        11

          accuracy                           0.56       253
         macro avg       0.50      0.45      0.41       253
      weighted avg       0.63      0.56      0.55       253



In [34]:
X = df.comment
y = df.flair
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size = 0.1, random_state=42)
mlp_classifier(X_train, X_val, y_train, y_val)

accuracy 0.6086956521739131
                    precision    recall  f1-score   support

          AskIndia       0.80      0.97      0.88        65
        Unverified       0.29      0.13      0.18        15
     Non-Political       0.56      0.60      0.58        15
         Scheduled       0.88      0.81      0.84        26
       Photography       0.50      0.57      0.53        14
Science/Technology       0.44      0.39      0.41        18
          Politics       0.56      0.53      0.55        17
  Business/Finance       1.00      0.65      0.79        20
    Policy/Economy       0.67      0.12      0.21        16
            Sports       0.50      0.61      0.55        18
              Food       0.18      0.39      0.25        18
     [R]eddiquette       0.67      0.18      0.29        11

          accuracy                           0.61       253
         macro avg       0.59      0.50      0.50       253
      weighted avg       0.64      0.61      0.60       253



In [35]:
X = df.body
y = df.flair
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size = 0.1, random_state=42)
mlp_classifier(X_train, X_val, y_train, y_val)

accuracy 0.4782608695652174
                    precision    recall  f1-score   support

          AskIndia       0.92      0.94      0.93        65
        Unverified       0.00      0.00      0.00        15
     Non-Political       0.67      0.40      0.50        15
         Scheduled       0.19      0.96      0.31        26
       Photography       0.00      0.00      0.00        14
Science/Technology       0.50      0.17      0.25        18
          Politics       0.33      0.06      0.10        17
  Business/Finance       0.95      1.00      0.98        20
    Policy/Economy       0.00      0.00      0.00        16
            Sports       0.00      0.00      0.00        18
              Food       0.44      0.22      0.30        18
     [R]eddiquette       1.00      0.09      0.17        11

          accuracy                           0.48       253
         macro avg       0.42      0.32      0.29       253
      weighted avg       0.50      0.48      0.43       253



In [36]:
X = df.feature_combine
y = df.flair
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size = 0.1, random_state=42)
mlp_classifier(X_train, X_val, y_train, y_val)

accuracy 0.6482213438735178
                    precision    recall  f1-score   support

          AskIndia       0.97      0.97      0.97        65
        Unverified       0.38      0.33      0.36        15
     Non-Political       0.53      0.60      0.56        15
         Scheduled       0.95      0.81      0.88        26
       Photography       0.83      0.71      0.77        14
Science/Technology       0.24      0.50      0.33        18
          Politics       1.00      0.06      0.11        17
  Business/Finance       0.90      0.95      0.93        20
    Policy/Economy       0.33      0.12      0.18        16
            Sports       0.45      0.94      0.61        18
              Food       0.33      0.11      0.17        18
     [R]eddiquette       0.40      0.55      0.46        11

          accuracy                           0.65       253
         macro avg       0.61      0.55      0.53       253
      weighted avg       0.70      0.65      0.63       253



The code below imports the pickle library and save the best model as pickle object.<br>
The best model is the SVM classifier with accuracy of 73.12% on the feature_combine features of the dataset.

In [42]:
import pickle


In [43]:
X = df.feature_combine
y = df.flair
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size = 0.1, random_state=42)

In [44]:
sgd_classifier = Pipeline([('vect', CountVectorizer()),
                               ('tfidf', TfidfTransformer()),
                               ('clf', linear_model.SGDClassifier(loss='hinge', penalty='l2', alpha=1e-5, random_state=42, max_iter=100, tol=None)),
                              ])
sgd_classifier.fit(X_train, y_train)
pickle.dump(sgd_classifier,open("sgdClf.pkl",'wb'))