<h1>3 Flair Detector </h1>
<br>
The Logistic Regression classification algorithm is easily generalized to multiple classes. 

<p> Reading data from the CSV file and importing some necessary packages. </p>

In [12]:
import pandas as pd
import numpy as np
import os
import re
import nltk
from nltk.corpus import stopwords
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.metrics import classification_report
from sklearn.impute import SimpleImputer

In [13]:
india_data = pd.read_csv("reddit-india-data.csv")

In [14]:
flairs = india_data.flair.unique()

<b>3.1 Text Pre-Processing:</b> For our data set from reddit, the text cleaning step includes:
    a) remove stop words, 
    b) change text to lower case, 
    c) remove punctuation, 
    d) remove bad characters
 

<p>Using the regular expression module, we describe the symbols that need replacement with space and need to be removed. This is done with the sub() function in the re module. In stop_words, we store the stopwords list taken from the natural language toolkit (nltk). </p>

In [15]:
replace_by_space = re.compile('[/(){}\[\]\|@,;]')
bad_symbols = re.compile('[^0-9a-z #+_]')
stop_words = stopwords.words('english')

def clean_data(text):
    #converting to lowercase
    text = text.lower()
    #re.sub(new_value, text_to_processed) 
    text = replace_by_space.sub(' ', text)
    text = bad_symbols.sub('', text)
    #removing the stopwords
    text = ' '.join(word for word in text.split() if word not in stop_words) 
    return text

<p><b>3.2</b> The url has certain extra cleaning steps involved. The reason for these steps was discovered during the data analysis.  </p>

In [16]:
def clean_url(u):
    if u.startswith("http://"):
        u = u[7:]
    if u.startswith("https://"):
        u = u[8:]
    if u.startswith("www."):
        u = u[4:]
    if u.endswith("/"):
        u = u[:-1]
    return u

In [17]:
def reddit_url(u):
    u = u.replace('redditcom', '')
    u = u.replace('r', '')
    u = u.replace('india', '')
    u = u.replace('comments','')
    for word in u:
        u = ' '.join(u.split('_'))
    return u

<p><b>3.3</b> Cleaning all the values that will be taken into consideration for the classifier. </p>

In [18]:
india_data['title'] = india_data['title'].apply(clean_data)
india_data['comments'] = india_data['comments'].astype('str').apply(clean_data)
india_data['body'] = india_data['body'].astype('str').apply(clean_data)

In [19]:
india_data['url'] = india_data['url'].apply(clean_url)
india_data['url'] = india_data['url'].apply(clean_data)

In [20]:
india_data['url'] = india_data['url'].apply(reddit_url)
india_data['url'] = india_data['url'].apply(clean_data)

<h3> Logistic Regression Model (Title) </h3>

In [23]:
X = india_data.title
y = india_data.flair
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1) 

logreg = Pipeline([('vect', CountVectorizer()),
                   ('tfidf', TfidfTransformer()),
                   ('clf', LogisticRegression(n_jobs=1, C=1e5))])
logreg.fit(X_train, y_train)

y_pred = logreg.predict(X_test)
print('accuracy %s' % accuracy_score(y_pred, y_test))
print(classification_report(y_test, y_pred,target_names=flairs))

accuracy 0.7
                  precision    recall  f1-score   support

   Non-Political       0.78      0.74      0.76        19
       Scheduled       0.39      0.64      0.49        14
     Photography       0.92      0.92      0.92        24
        Politics       0.83      0.75      0.79        20
Business/Finance       0.96      1.00      0.98        24
  Policy/Economy       0.87      0.68      0.76        19
          Sports       0.53      0.42      0.47        24
            Food       0.53      0.59      0.56        17
        AskIndia       0.55      0.55      0.55        20
     Coronavirus       0.63      0.63      0.63        19

        accuracy                           0.70       200
       macro avg       0.70      0.69      0.69       200
    weighted avg       0.72      0.70      0.70       200



STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


<h3>Logistic Regression Model (Title + Comments) </h3>

In [24]:
india_data['title_comments'] = india_data['title'] + ' ' + india_data['comments']

In [25]:
X = india_data.title_comments
y = india_data.flair

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1) 

logreg = Pipeline([('vect', CountVectorizer()),
                   ('tfidf', TfidfTransformer()),
                   ('clf', LogisticRegression(n_jobs=1, C=1e5))])
logreg.fit(X_train, y_train)

y_pred = logreg.predict(X_test)
print('accuracy %s' % accuracy_score(y_pred, y_test))
print(classification_report(y_test, y_pred,target_names=flairs))

accuracy 0.71
                  precision    recall  f1-score   support

   Non-Political       0.42      0.53      0.47        19
       Scheduled       0.79      0.50      0.61        22
     Photography       0.82      0.86      0.84        21
        Politics       0.78      0.82      0.80        17
Business/Finance       0.67      0.71      0.69        17
  Policy/Economy       0.91      0.95      0.93        21
          Sports       0.76      0.64      0.70        25
            Food       0.52      0.76      0.62        17
        AskIndia       0.72      0.57      0.63        23
     Coronavirus       0.83      0.83      0.83        18

        accuracy                           0.71       200
       macro avg       0.72      0.72      0.71       200
    weighted avg       0.73      0.71      0.71       200



STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


<h3>Logistic Regression Model (Title + Comments + Body) </h3>

In [26]:
india_data['title_comments_body'] = india_data['title_comments'] + '  ' + india_data['body']

In [27]:
X = india_data.title_comments_body
y = india_data.flair
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1) 

logreg = Pipeline([('vect', CountVectorizer()),
                   ('tfidf', TfidfTransformer()),
                   ('clf', LogisticRegression(n_jobs=1, C=1e5))])
logreg.fit(X_train, y_train)

y_pred = logreg.predict(X_test)
print('accuracy %s' % accuracy_score(y_pred, y_test))
print(classification_report(y_test, y_pred,target_names=flairs))

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


accuracy 0.775
                  precision    recall  f1-score   support

   Non-Political       0.73      0.67      0.70        24
       Scheduled       0.75      0.50      0.60        18
     Photography       1.00      0.92      0.96        12
        Politics       0.88      0.91      0.89        23
Business/Finance       0.57      0.67      0.62        12
  Policy/Economy       0.95      0.84      0.89        25
          Sports       0.59      0.70      0.64        23
            Food       0.74      0.85      0.79        20
        AskIndia       0.60      0.80      0.69        15
     Coronavirus       0.96      0.86      0.91        28

        accuracy                           0.78       200
       macro avg       0.78      0.77      0.77       200
    weighted avg       0.79      0.78      0.78       200



In [28]:
india_data_impute_body = india_data.apply(lambda x: x.fillna(x.value_counts().index[0]))
india_data_impute_body['title_comments_body'] = india_data_impute_body['title'] + '  ' + india_data_impute_body['comments'] + '  ' + india_data_impute_body['body']
X = india_data_impute_body.title_comments_body
y = india_data_impute_body.flair

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1) 

logreg = Pipeline([('vect', CountVectorizer()),
                   ('tfidf', TfidfTransformer()),
                   ('clf', LogisticRegression(n_jobs=1, C=1e5))])
logreg.fit(X_train, y_train)

y_pred = logreg.predict(X_test)
print('accuracy %s' % accuracy_score(y_pred, y_test))
print(classification_report(y_test, y_pred,target_names=flairs))

accuracy 0.805
                  precision    recall  f1-score   support

   Non-Political       0.71      0.81      0.76        21
       Scheduled       0.89      0.89      0.89        18
     Photography       0.95      0.95      0.95        22
        Politics       0.87      0.87      0.87        23
Business/Finance       0.82      0.74      0.78        19
  Policy/Economy       0.94      0.79      0.86        19
          Sports       0.74      0.58      0.65        24
            Food       0.76      0.76      0.76        21
        AskIndia       0.73      0.83      0.78        23
     Coronavirus       0.64      0.90      0.75        10

        accuracy                           0.81       200
       macro avg       0.81      0.81      0.80       200
    weighted avg       0.81      0.81      0.80       200



STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


<h3>Logistic Regression Model (Title + Comments + Body + URL) </h3>

In [29]:
india_data['title_comments_body_url'] = india_data['title_comments_body'] + ' ' + india_data['url']

In [30]:
X = india_data.title_comments_body_url
y = india_data.flair
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1) 

logreg = Pipeline([('vect', CountVectorizer()),
                   ('tfidf', TfidfTransformer()),
                   ('clf', LogisticRegression(n_jobs=1, C=1e5))])
logreg.fit(X_train, y_train)

y_pred = logreg.predict(X_test)
print('accuracy %s' % accuracy_score(y_pred, y_test))
print(classification_report(y_test, y_pred,target_names=flairs))

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


accuracy 0.785
                  precision    recall  f1-score   support

   Non-Political       0.73      0.70      0.72        27
       Scheduled       0.71      0.58      0.64        26
     Photography       0.83      0.88      0.86        17
        Politics       0.92      0.88      0.90        25
Business/Finance       1.00      0.90      0.95        10
  Policy/Economy       1.00      0.95      0.97        20
          Sports       0.56      0.70      0.62        20
            Food       0.52      0.80      0.63        15
        AskIndia       0.83      0.79      0.81        19
     Coronavirus       1.00      0.81      0.89        21

        accuracy                           0.79       200
       macro avg       0.81      0.80      0.80       200
    weighted avg       0.81      0.79      0.79       200



In [31]:
india_data_impute_body = india_data.apply(lambda x: x.fillna(x.value_counts().index[0]))
india_data_impute_body['title_comments_body_url'] = india_data_impute_body['title'] + '  ' + india_data_impute_body['comments'] + '  ' + india_data_impute_body['body'] + '  ' + india_data_impute_body['url']
X = india_data_impute_body.title_comments_body_url
y = india_data_impute_body.flair

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1) 

logreg = Pipeline([('vect', CountVectorizer()),
                   ('tfidf', TfidfTransformer()),
                   ('clf', LogisticRegression(n_jobs=1, C=1e5))])
logreg.fit(X_train, y_train)

y_pred = logreg.predict(X_test)
print('accuracy %s' % accuracy_score(y_pred, y_test))
print(classification_report(y_test, y_pred,target_names=flairs))

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


accuracy 0.81
                  precision    recall  f1-score   support

   Non-Political       0.84      0.76      0.80        21
       Scheduled       0.64      0.70      0.67        20
     Photography       0.85      0.94      0.89        18
        Politics       0.92      0.75      0.83        16
Business/Finance       1.00      0.86      0.93        22
  Policy/Economy       0.85      0.85      0.85        26
          Sports       0.78      0.74      0.76        19
            Food       0.72      0.82      0.77        22
        AskIndia       0.78      0.82      0.80        17
     Coronavirus       0.80      0.84      0.82        19

        accuracy                           0.81       200
       macro avg       0.82      0.81      0.81       200
    weighted avg       0.82      0.81      0.81       200



<h3>Logistic Regression Model (Title + Body) </h3>

In [32]:
india_data['title_body'] = india_data['title'] + '  ' + india_data['body']

In [33]:
X = india_data.title_body
y = india_data.flair
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1) 

logreg = Pipeline([('vect', CountVectorizer()),
                   ('tfidf', TfidfTransformer()),
                   ('clf', LogisticRegression(n_jobs=1, C=1e5))])
logreg.fit(X_train, y_train)

y_pred = logreg.predict(X_test)
print('accuracy %s' % accuracy_score(y_pred, y_test))
print(classification_report(y_test, y_pred,target_names=flairs))

accuracy 0.815
                  precision    recall  f1-score   support

   Non-Political       1.00      1.00      1.00        18
       Scheduled       0.64      0.84      0.73        19
     Photography       0.95      0.86      0.90        21
        Politics       0.95      0.87      0.91        23
Business/Finance       0.77      0.95      0.85        21
  Policy/Economy       0.94      0.89      0.92        19
          Sports       0.75      0.50      0.60        24
            Food       0.61      0.88      0.72        16
        AskIndia       0.75      0.63      0.69        19
     Coronavirus       0.89      0.80      0.84        20

        accuracy                           0.81       200
       macro avg       0.83      0.82      0.82       200
    weighted avg       0.83      0.81      0.81       200



STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


<h3>Logistic Regression Model (Title + URL) </h3>

In [34]:
india_data['title_url'] = india_data['title'] + '  ' + india_data['url']

In [35]:
X = india_data.title_url
y = india_data.flair
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1) 

logreg = Pipeline([('vect', CountVectorizer()),
                   ('tfidf', TfidfTransformer()),
                   ('clf', LogisticRegression(n_jobs=1, C=1e5))])
logreg.fit(X_train, y_train)

y_pred = logreg.predict(X_test)
print('accuracy %s' % accuracy_score(y_pred, y_test))
print(classification_report(y_test, y_pred,target_names=flairs))

accuracy 0.75
                  precision    recall  f1-score   support

   Non-Political       0.82      0.78      0.80        23
       Scheduled       0.75      0.41      0.53        22
     Photography       0.96      1.00      0.98        24
        Politics       0.94      0.81      0.87        21
Business/Finance       0.89      1.00      0.94        17
  Policy/Economy       0.81      0.71      0.76        24
          Sports       0.52      0.68      0.59        19
            Food       0.63      0.71      0.67        17
        AskIndia       0.54      0.72      0.62        18
     Coronavirus       0.67      0.67      0.67        15

        accuracy                           0.75       200
       macro avg       0.75      0.75      0.74       200
    weighted avg       0.77      0.75      0.75       200



STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


<h3>Logistic Regression Model (Title + Comments + URL) </h3>

In [36]:
india_data['title_comments_url'] = india_data['title'] + '  ' + india_data['comments'] + '  ' + india_data['url']

In [37]:
X = india_data.title_comments_url
y = india_data.flair
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1) 

logreg = Pipeline([('vect', CountVectorizer()),
                   ('tfidf', TfidfTransformer()),
                   ('clf', LogisticRegression(n_jobs=1, C=1e5))])
logreg.fit(X_train, y_train)

y_pred = logreg.predict(X_test)
print('accuracy %s' % accuracy_score(y_pred, y_test))
print(classification_report(y_test, y_pred,target_names=flairs))

accuracy 0.73
                  precision    recall  f1-score   support

   Non-Political       0.67      0.74      0.70        19
       Scheduled       0.69      0.50      0.58        18
     Photography       0.81      1.00      0.89        17
        Politics       0.67      0.86      0.75        14
Business/Finance       1.00      0.83      0.90        23
  Policy/Economy       0.76      0.81      0.79        16
          Sports       0.62      0.75      0.68        20
            Food       0.68      0.50      0.58        26
        AskIndia       0.65      0.61      0.63        18
     Coronavirus       0.74      0.79      0.77        29

        accuracy                           0.73       200
       macro avg       0.73      0.74      0.73       200
    weighted avg       0.73      0.73      0.73       200



STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


<h3>Logistic Regression Model (Title + Body + URL) </h3>

In [38]:
india_data['title_body_url'] = india_data['title'] + '  ' + india_data['body'] + '  ' + india_data['url']

In [39]:
X = india_data.title_body_url
y = india_data.flair
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1) 

logreg = Pipeline([('vect', CountVectorizer()),
                   ('tfidf', TfidfTransformer()),
                   ('clf', LogisticRegression(n_jobs=1, C=1e5))])
logreg.fit(X_train, y_train)

y_pred = logreg.predict(X_test)
print('accuracy %s' % accuracy_score(y_pred, y_test))
print(classification_report(y_test, y_pred,target_names=flairs))

accuracy 0.805
                  precision    recall  f1-score   support

   Non-Political       0.94      0.81      0.87        21
       Scheduled       0.68      0.75      0.71        20
     Photography       0.91      0.87      0.89        23
        Politics       0.85      0.89      0.87        19
Business/Finance       0.88      0.88      0.88        25
  Policy/Economy       0.88      0.88      0.88        17
          Sports       0.73      0.73      0.73        22
            Food       0.67      0.82      0.74        17
        AskIndia       0.63      0.60      0.62        20
     Coronavirus       0.93      0.81      0.87        16

        accuracy                           0.81       200
       macro avg       0.81      0.80      0.81       200
    weighted avg       0.81      0.81      0.81       200



STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


<h3>Logistic Regression Model (Comments + Body + URL) </h3>

In [40]:
india_data['comments_body_url'] = india_data['comments'] + '  ' + india_data['body'] + '  ' + india_data['url']

In [41]:
X = india_data.title_body_url
y = india_data.flair
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1) 

logreg = Pipeline([('vect', CountVectorizer()),
                   ('tfidf', TfidfTransformer()),
                   ('clf', LogisticRegression(n_jobs=1, C=1e5))])
logreg.fit(X_train, y_train)

y_pred = logreg.predict(X_test)
print('accuracy %s' % accuracy_score(y_pred, y_test))
print(classification_report(y_test, y_pred,target_names=flairs))

accuracy 0.79
                  precision    recall  f1-score   support

   Non-Political       0.90      0.86      0.88        21
       Scheduled       0.81      0.71      0.76        24
     Photography       0.88      0.88      0.88        24
        Politics       0.93      0.88      0.90        16
Business/Finance       1.00      0.96      0.98        24
  Policy/Economy       0.80      0.71      0.75        17
          Sports       0.48      0.65      0.55        17
            Food       0.60      0.80      0.69        15
        AskIndia       0.64      0.67      0.65        21
     Coronavirus       0.94      0.76      0.84        21

        accuracy                           0.79       200
       macro avg       0.80      0.79      0.79       200
    weighted avg       0.81      0.79      0.80       200



STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


<p><b>
The best results are seen when title, comments, body and url (with imputed values) are all used as a combined feature to predict the flair with an accuracy of 77-81%. 
</b></p>